0703:q-bio0703053/simap.tex

1: \documentclass[floatfix,twocolumn,showpacs,preprintnumbers,amsmath,amssymb]{revtex4}

2:

3: \usepackage{graphicx,epsfig,amsfonts}% Include figure files

4: \usepackage{dcolumn}% Align table columns on decimal point

5: \usepackage{psfrag}

6: \usepackage{concmath,charter}

7: \usepackage{subfigure}

8: \begin{document}

9:

10: \newcommand{\e}[1]{\emph{#1}}

11: \newcommand{\avg}[1]{\langle #1 \rangle}

12: \newcommand{\va}[0]{{\mathbf a}}

13: \newcommand{\vb}[0]{{\mathbf b}}

14: \newcommand{\vc}[0]{{\mathbf c}}

15:

16:

17: \title{Global statistical analysis of the protein homology network}

18:

19: \author{C.~Miccio}

20: \email{miccio@mib.infn.it}

21: \affiliation{

22:   Dipartimento di Fisica G.Occhialini, Universit\`a di

23:   Milano--Bicocca and INFN, Sezione di Milano, Piazza della Scienza 3

24:   - I-20126 Milano, Italy}

25: \author{T.~Rattei}

26: \email{t.rattei@wzw.tum.de}

27: \affiliation{

28:   Department of Genome Oriented Bioinformatics, Technical University

29:   of Munich,

30:   Wissenschaftszentrum 5 Weihenstephan, 85350 Freising,

31:   Germany }

32:

33: \date{\today}

34:

35: \begin{abstract}

36:   The similarity between protein sequences is a directly and easly

37:   computed quantity from which to deduce information about their

38:   evolutionary distance and to detect homologous proteins. The {\emph

39:     SIMAP} database -- {\emph Similarity Matrix of Proteins} --

40:   provides a pre-computed similarity matrix covering the similarity

41:   space formed by about all publicly available amino acid sequences

42:   from public databases and completely sequenced genomes.  From SIMAP

43:   we construct the protein homology network, where the proteins are

44:   the nodes and the links represent homology relationships.  With more

45:   than $5$ million nodes and about $70 \times 10^9$ edges it is the

46:   greatest protein homology network ever been builded.  We

47:   describe the basic features and we perform a global statistical

48:   analysis of the network. Starting from the Smith-Waterman similarity

49:   score, we define for each edge a weight $w$ to measure the

50:   similarity distance between two nodes. Keeping only edges with a

51:   weigth greater than a minimal $\bar w$, and by varying $\bar w$ we

52:   build a family of networks with different degree of similarity. We

53:   investigate the distribution of connected components (clusters) of

54:   the networks at different $\bar w$ and in particular we find a

55:   behaviour similar to a phase transition guided by the formation of a

56:   giant component. Moreover we study selected sequence features and

57:   protein domains of protein pairs that connect different clusters in

58:   the networks at different level of similarity.  We observed

59:   specific, non-random distributions of the protein features and

60:   domains for proteins connecting clusters at certain weight

61:   intervals.

62: \end{abstract}

63: %\pacs{87.10.+e, 05.10.Ln}

64:

65: \maketitle

66:

67: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

68: \section{Background}

69:

70: The number of known proteins is rapidly growing and the sequence of

71: amino acids is, at the moment, the main source of information for many

72: new proteins which still have unidentified functions. Protein sequence

73: analysis, and more specifically, the analysis of similarities among

74: protein sequences, is therefore the basis of studies trying to

75: understand protein evolutionary processes or to detect unknown

76: biological functions of new proteins. Proteins with similar sequences

77: can be found in different organisms and in a single organism

78: \footnote{Due to duplication and shuffling of coding segments in the

79:   akno DNA during the evolution.}, \cite{revEvol}. By means of the

80: degree of similarity obtained by a pairwise sequence comparison it is

81: possible to deduce information about their evolutionary distance.

82: Specifically, two proteins are homologous if they evolved from a

83: common ancestral protein sequence and, in most cases, they have also

84: the same, or very similar, biological function. Homology can be

85: deduced from statistically significant sequence similarities. However,

86: new sequences often have only weak similarities to known proteins, and

87: single similarities search are insufficient to assign validated

88: properties of characterized proteins to new sequences. Instead a graph

89: formed by all-against-all comparisons of a large amount of

90: protein-data could become useful. This is the case of {\bf SIMAP} --

91: \e{Similarity Matrix of Proteins} -- a database containing the

92: similarity space formed by almost all amino acid sequences, with

93: nearly 5.5 million non-redundant protein sequences drawn from

94: completely sequenced genomes and public database. Moreover,

95: pre-calculated similarity space allows very rapid access to

96: significant hits of interest and prevents time-consuming

97: re-computation. The algorithm that precomputes the sequences

98: similarities is based on the FASTA heuristic. First it compares

99: low-complexity masked proteins using FASTA and then it recalculates

100: the hits found using non-masked sequences and the Smith-Waterman

101: algorithm. In both phases of the alignment process the BLOSUM50 amino

102: acids substitution matrix is used. For each hit the Smith-Waterman

103: score, the identity, the gapped identity, the overlap and the start

104: and the stop coordinates of the alignment in

105: both proteins are stored. For more details see \cite{simap}.\\

106: Graphs formed by all-against-all sequence comparisons can be used to

107: derive inheritance patterns of proteins, to reconstruct the

108: evolutionary relationships between proteins and to classify them into

109: protein families by looking for dense clusters disconnected from the

110: rest of the network. To date, this approach has been carefully

111: evaluated by case studies targeted at selected protein families

112: \cite{phn}, but a global analysis of the complete homology network

113: formed by all publicly available proteins has not been published. The

114: aim of this work is to analyze global and local properties of the

115: graph forming the homology network.

116:

117:

118: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

119: \section{SIMAP graph representation}

120:

121: The information contained in the Simap database can be reorganized by

122: means of a weighted graph representation, $G(V, E, w)$, where $V$ is

123: the set of nodes, $E$ the set of edges, and $w$ a weight function on

124: the edges: $w : E \to [0,1]$. Each node, $\va \in V$, represents a

125: protein sequence and each edge, $e = \{ \va,\vb \} \in E$ between two

126: nodes $\va$, $\vb$ represents the stored alignment between the

127: respective protein sequences\footnote{For simplicity we will use the

128:   same notation to point graphs's nodes and database's proteins.}.  In

129: this way an undirected weighted graph can be obtained, since the

130: symmetry of the alignment procedure leads to undirected edges and the

131: score of the alignment allows the assignment of a suitable weight to

132: every edge. (Despite the possibility of making an alignment between a

133: protein sequence and itself, self-edges are not considered). More

134: specifically if $s(\va,\vb)$ is the Smith-Waterman (SW) optimal score

135: obtained with the FASTA algorithm between sequence $\va$ and $\vb$, a

136: suitable weight $w(\va,\vb) \in [0,1]$ for the edge $e = \{ \va,\vb

137: \}$ can be defined as follow:

138: \begin{equation}

139: \label{eq:weight}

140:   w(\va,\vb) = \frac{s(\va,\vb)}{ \sqrt{ \; s(\va,\va) \;

141:       s(\vb,\vb)}},

142: \end{equation}

143: From $w(\va,\vb)$ one could define a distance function as $d(\va,\vb)

144: = 1 - w(\va,\vb)\;$, whose values are in $[0, 1]$ as distance function

145: usually defined on linear spaces. $d$ should satisfy positivity, null

146: and simmetry properties for all pairs of sequence proteins and also

147: the triangular inequality which is fully satisfied for the BLOSUM50

148: matrix.

149:

150: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

151: \section{Polishing procedure}

152: Strictly speaking, the set of all protein sequences of the Simap

153: database is not a good space over which to define the distance measure

154: $d$. There are, in fact, $1538$ pairs of sequences that have distance

155: equal to zero, although they are classified with a different sequence

156: id. However, they differ only in the presence of one or two $'$X$'$ in

157: their amino acid sequence annotation, where $'$X$'$ is the standard

158: symbol for an unknown amino acid residue in a protein sequence.  It is

159: therefore natural to decide to knock out, for each of these pairs of

160: sequences, the one that has the $'$X$'$ in the sequence; this

161: procedure entails the removal, in the graph representation, of all

162: edges connected to the removed nodes. Another improvment for database

163: consistency is the checking of symmetry of all edges: every time, a

164: direct edge is found, the inverse relation, if absent, is added.

165:

166: As a final result of these manipulations, a graph with $V = 5,489,907$

167: nodes and $E = 69,500,722,050$ edges can be constructed.

168:

169: Over the polished Simap protein sequences space the distance $d = 1 -

170: w(\va,\vb)\;$ fails the triangular inequality over few cases (around

171: $\approx 0.2 \%$ of triangles). However redefining, for istance,

172:

173: \begin{equation}

174: \label{eq:distance}

175: d(\va,\vb) =\sqrt{1 - w(\va,\vb)},

176: \end{equation}

177: we have that the triangle inequality is satisfied for all triples of

178: linked proteins and (\ref{eq:distance}) has all properties required

179: for a \e{distance measure}.

180:

181: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

182: \section{Characterization of Simap protein space}

183:

184: In the Simap database, protein sequences come from $104,560$ different

185: species.  There are, in particular, $3$ species (\e{Homo sapiens},

186: \e{Arabidopsis thaliana}, \e{Rice plants}) with more than $100,000$

187: protein sequences and $72$ with more than $10,000$.

188:

189: \begin{table}[!htb]

190:   \begin{center}

191:       \begin{tabular}{|c|c|c|}

192:         \vspace{-10pt} & & \\ \hline \it{kingdoms} & & \it{number of species} \\

193:         \vspace{-10pt} & & \\ \hline

194:         bacteria &                &  $11,130$    \\ \hline

195:         viruses  &  viruses       &  $13,708$    \\

196:         &  phages        &  $923$     \\ \hline

197:         plants   &                &  $31,232$   \\ \hline

198:         animalia & invertebrates  &  $25,951$   \\ %\cline{2-4}

199:         & vertebrates    &  $19,341$   \\

200:         & (rodents)      &  $(1,474)$  \\

201:         & (mammals)      &  $(1,854)$  \\

202:         & (primates)     &  $(393)$   \\ \hline

203:         environmental samples  &   &  $1,453$    \\ \hline

204:         synthetic              &   &  $822$      \\ \hline

205:       \end{tabular}

206:       \caption{\label{tab1} \small Number of species for each

207:         kingdom.}

208:   \end{center}

209: \end{table}

210:

211: A coarse subdivision of all species is shown in

212: Table~\ref{tab1}; it separates species in five (non-standard)

213: main kingdoms: bacteria, viruses, plants, invertebrates (animalia) and

214: vertebrates (animalia). The classification reveals the presence of

215: very many different animalia species, but only eight of these species

216: are present with their complete genome (the other animalia proteins

217: were imported from multiple species databases).

218: Figure~\ref{fig1} shows the protein distribution for

219: each kingdom. There is also a high number ($546,439$) of unassigned

220: protein sequences.\footnote{These sequences come from databases:

221:   \e{PDB proteins}, \e{mips non-redundant protein database},

222:   \e{UNIPROT SWISSPROT}, \e{UNIPROT-TrEMBL}, \e{PFAM sequences},

223:   \e{Eukaryotic signature proteins.}}.

224:

225: \begin{figure}[!htb]

226:   \begin{center}

227:     \includegraphics[height=0.36\textwidth,angle=270]{f1.eps}

228:     \caption{\label{fig1} {\small Distribution of

229:         proteins for each kingdom. The little graph shows the

230:         distribution within vertebrates.}}

231:   \end{center}

232: \end{figure}

233:

234: \subsection{Length and self-similarity distribution}

235:

236: \begin{figure}[!htb]

237:   \includegraphics[height=0.70\textwidth]{f2.eps}

238:   \caption{\label{fig2}{\small (a) Distribution of protein sequences'

239:       lengths. In the inner boxe an enlargement of the distribution is

240:       shown. (b) Length distributions of protein sequences which

241:       belong to \e{bacteria} ($\avg{l} = 316.9$, $l_{max} = 36805$),

242:       \e{viruses} ($\avg{l} = 273.9$,$l_{max} = 7312$ ), \e{plants}

243:       ($\avg{l} = 314.5$, $l_{max} = 20925$), \e{invertebrated}

244:       ($\avg{l} = 416.1$, $l_{max} = 23015$), \e{vertebrated}

245:       ($\avg{l} = 397.1$, $l_{max} = 38031$).}}

246: \end{figure}

247:

248: The protein sequences space is characterized by the length

249: distribution shown in Figure~\ref{fig2}a and in Figure~\ref{fig2}b we

250: give the length distributions for sequences belonging to bacteria,

251: viruses, plants, vertebrates and invertebrates.

252:

253: \begin{figure}[!htb]

254:   \vspace{0.2cm}

255:   \includegraphics[height=0.36\textwidth,angle=270]{f3.eps}

256:   \caption{\label{fig3}{\small Distribution of protein sequences'

257:       self-scores. In the inner boxe an enlargement of the

258:       distribution is shown. }}

259: \end{figure}

260:

261: The \e{self-similarity} \e{score} 's distribution of protein sequence

262: appears in Figure~\ref{fig3}.  The self-similarity scores distribution

263: is well reproduced by a mixture of normal distributions, one for each

264: length entry. The self-similarity score $s(\va,\va)$ of a protein

265: sequence of length $l$, can be thougth as a sum of $l$ i.i.d.  random

266: variables, i.e. a sum of the self-similarities scores of random amino

267: acids. Knowing the amino acids background probabilities\footnote{ The

268:   values for background distribution of amino acids come from data

269:   used for the PAM matrix: $\;p_A=0.096;\; p_R=0.034;\; p_N=0.042;\;

270:   p_D=0.053;\; p_C=0.025;\; p_Q=0.032;\; p_E=0.053;\; p_G=0.090;\;

271:   p_H=0.034;\; p_I=0.035;\; p_L= 0.084;\; p_K=0.085;\; p_M=0.012;\;

272:   p_F=0.045;\; p_P=0.041;\; p_S=0.057;\; p_T=0.062;\; p_W=0.012;\;

273:   p_Y=0.030;\; p_V=0.078$.\\ They can be obtained from \e{{\small

274:       http://apps.bioneq.qc.ca/twiki/pub/Knowledgebase/PAM/}}

275:   \e{{\small PAM2.JPG}}} $p_{a}$ and the diagonal values of the

276: BLOSUM50 score matrix, $B_{aa}$, the self-similarity score of a random

277: amino acid will follow a normal distribution with mean $\avg{s} =

278: \sum_a p_a \, B_{aa} \;\; ( \approx 6.727)$ and variance $ \sigma =

279: \sqrt{\sum_a p_a B_{aa}^2 - \avg{s} ^2} \;\;(\approx 2.067)$.

280: Self-similarity scores of random amino acid sequences of length $l$

281: will have a normal distribution $g(l,s)$ with mean $l\,\avg{s}$ and

282: variance $\sqrt{l\,\sigma^2}$.  Finally, the self-similarity scores

283: distribution is well approximated by the sum $\sum_{l} g(l,s) f(l)$,

284: where $f(l)$ is the observed length distribution, Figure~\ref{fig4}.

285:

286: \begin{figure}[!htb]

287:   \vspace{0.2cm}

288:   \includegraphics[height=0.36\textwidth,angle=270]{f4.eps}

289:   \caption{\label{fig4}{\small Distribution of protein sequences'

290:       self-scores and the curve obtained by an overlap of normal

291:       distributions opportunely wighted by the protein sequences's

292:       length distribution are compared.}}

293: \end{figure}

294:

295: \subsection{Pairwise similarity distribution}

296:

297: The SW optimum similarity scores distribution obtained from all FASTA

298: sequence alignments present a homogeneous cutoff equal to $80$, used

299: for storing hits in Simap database. It was chosen independently of the

300: query and database length, but as an optimal compromise between

301: sensitivity and possibility to store an accessible number of hits,

302: because of the high number of protein sequences.

303:

304: \begin{figure}[!htb]

305:   \includegraphics[height=0.70\textwidth]{f5.eps}

306:   \caption{\label{fig5}{\small (a) Distribution of edges' weights $w$.

307:       In the inner box is shown an enlargement of the distribution

308:       tail. (b) Repartition function edges' weights distribution.}}

309: \end{figure}

310:

311: In Figure~\ref{fig5}a the distribution of weights $w$ is shown, and in

312: Figure~\ref{fig5}b the corresponding repartition distribution $\rho(w)$. The

313: values of $\rho(w) \in [0,1]$ represent the fractions of edges which

314: have weight greater or equal to $w$. From them we see that the major

315: part of the edges (about $80\%$ of the total number of edges) has a

316: very low value of $w$ ($\leq 0.2$).

317:

318: \subsection{Coordination and cluster distribution}

319:

320: Weights $w$ can be used as a parameter to define a collection of

321: graphs. For a fixed value of $w = \bar{w}$ (or a value of $d = \bar{d}

322: = \sqrt{1 -\bar{w} }$ ), a graph is built keeping only edges with $w >

323: \bar{w}$ ($d \le \bar{d}$). For high values of $\bar{w}$, i.e. at

324: small distances, nodes are linked if, and only if, the corresponding

325: protein sequences have a high degree of similarity; then it is

326: reasonable to expect graphs with many small connected components. By

327: decreasing $\bar{w}`$ values, in other words by also linking proteins

328: having a lower degree of similarity, graphs with larger connected

329: components are expected. The graph obtained by considering all

330: possible edges (by fixing $\bar{w} = 0$) is not the complete graph,

331: due to the cutoff on the score alignment (there are about $0.1 \%$ of

332: edges of the corresponding complete graph).

333:

334: We have built graphs for values of $w$ equal to $0.975$, $0.95$,

335: $0.925$, $ 0.9$, $0.875$, $0.85$, $0.825$, $0.8$, $0.775$, $0.75$,

336: $0.725$, $0.7$, $0.675$, $0.65$, $0.625$, $0.6$, $0.575$, $0.55$,

337: $0.525$, $0.5$, $0.475$, $0.45$, $0.425$, $0.4$ $0.375$, $0.35$,

338: $0.325$, $0.3$, $0.275$, $0.25$, $0.225$, $0.2$, $0.175$, $0.15$,

339: $0.125$; $\;$ for each of these values the set of the protein

340: sequences splits into clusters, i.e. isolated connected components.

341: Linking proteins that have a greater and greater distance from each

342: other (decresing $\bar{w}$), clusters merge to form larger clusters,

343: the number of isolated proteins and the number of components with a

344: very small size decreases, while the number of clusters of medium and

345: large size increases.

346:

347: \begin{figure}[!htb]

348:

349:     \subfigure[]{\label{fig6a}

350:       \includegraphics[width=0.34\textwidth, angle=270]{f6a.ps}

351:     } \vspace{-0.4cm}

352:     \subfigure[]{\label{fig6b}

353:       \includegraphics[width=0.40\textwidth, height=0.42\textwidth, angle=270]{f6b.ps}

354:     }

355:

356:   \caption{{\small (a) Distribution of size of connected

357:       components of the protein sequences graph built at $\bar{w} =

358:       0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and $\bar{w} =

359:       0.4$ (blue curve). It is evident that as the $\bar{w}$ value

360:       decrease the number of connected components with small size

361:       decreases and the starting region of the power law behaviour

362:       shifts to higher values of size. (b) Distribution of

363:       coordination degree of the protein sequences graph built at

364:       $\bar{w} = 0.975$ (red curve), $\bar{w} = 0.75$ (pink curve) and

365:       $\bar{w} = 0.4$ (blue curve). As the $\bar{w}$ value decrease

366:       the number of nodes with coordination degree decreases and the

367:       starting region of the power law behaviour shifts to higher

368:       values of coordination degree.}}

369: \end{figure}

370:

371: Measuring the (not normalized) cluster distribution, we find that, for

372: each fixed values of $\bar{w}$, the number of clusters

373: $n_{\bar{w}}(s)$ of size $s$ follows, in a specific size range, a

374: power law behaviour, $n_{\bar{w}}(s) \sim s^{-\sigma(\bar{w})}$.

375: Fitted values of $\sigma(\bar{w})$ and fitting size ranges are

376: reported in Table~\ref{tab2} and a log-log plot of size

377: distribution $n_{\bar{w}}(s)$, for three different values of $\bar w$

378: is shown in Figure~\ref{fig6a}.  Also the (not normalized) coordination

379: degree distribution $f_{\bar w}(z)$ follows a power law distribution,

380: $f_{\bar w}(z) \sim z^{-\alpha(\bar w)}$, for each values of

381: $\bar{w}$. A log-log plot of coordination degree distribution

382: $f_{\bar{w}}(z)$, for three different values of $\bar w$ is shown in

383: Figure~\ref{fig6b}.  Fitted values of $\alpha(\bar{w})$ and fitting

384: coordination degree's ranges are reported in Table~\ref{tab3}.

385:

386: \begin{table}[!htb]

387:   \begin{center}

388:       \begin{tabular}{|c|ccc|} \hline

389:         \vspace{-10pt} & & & \\ $\bar{w}$ & $\sigma$ & \quad component

390:         & \quad correlation \\ & & \quad size range & \quad coefficient \\

391:         \vspace{-10pt} & & & \\ \hline

392:         \vspace{-10pt} & & & \\

393:         $\;$ $0.95$ $\;$   &   $\;$ $2.70$  &  $10 - 60$  &  $-0.995$ \\

394:         $\;$ $0.90$ $\;$    &   $\;$ $2.70$  &  $10 - 60$  &  $-0.996$ \\

395:         $\;$ $0.85$ $\;$    &   $\;$ $2.69$  &  $10 - 60$  &  $-0.994$ \\

396:         $\;$ $0.80$ $\;$    &   $\;$ $2.62$  &  $10 - 80$  &  $-0.996$ \\

397:         $\;$ $0.75$ $\;$    &   $\;$ $2.52$  &  $10 - 80$  &  $-0.996$ \\

398:         $\;$ $0.70$ $\;$    &   $\;$ $2.40$  &  $10 - 80$  &  $-0.996$ \\

399:         $\;$ $0.65$ $\;$    &   $\;$ $2.32$  &  $10 - 100$  &  $-0.997$ \\

400:         $\;$ $0.60$ $\;$    &   $\;$ $2.21$  &  $10 - 100$  &  $-0.996$ \\

401:         $\;$ $0.55$ $\;$    &   $\;$ $2.17$  &  $10 - 100$  &  $-0.996$ \\

402:         $\;$ $0.50$ $\;$    &   $\;$ $2.07$  &  $10 - 100$  &  $-0.997$ \\

403:         $\;$ $0.45$ $\;$    &   $\;$ $2.01$  &  $10 - 100$  &  $-0.997$ \\

404:         $\;$ $0.40$ $\;$    &   $\;$ $2.00$  &  $10 - 100$  &  $-0.996$ \\

405:         $\;$ $0.35$ $\;$    &   $\;$ $1.98$  &  $10 - 100$  &  $-0.997$ \\

406:         $\;$ $0.30$ $\;$    &   $\;$ $1.98$  &  $10 - 100$  &  $-0.997$ \\

407:         $\;$ $0.25$ $\;$    &   $\;$ $2.01$  &  $10 - 100$  &  $-0.996$ \\ \hline

408:       \end{tabular}

409:     \caption{\label{tab2} \small Fitting values of exponent

410:       $\sigma$ of the power law distribution of connected components

411:       for selected values of $\bar{w}$. For each fitting the size

412:       range and its correlation coefficient are reported.}

413: \end{center}

414: \end{table}

415:

416:

417: \begin{table}[!htb]

418:   \begin{center}

419:     \begin{tabular}{|c|c|c|ccc|} \hline

420:       \vspace{-10pt} & & & &\\

421:

422:       $\bar{w}$ &  $\avg{z}$ & max $z$ & $\alpha$ & \quad coordination & \quad

423:       correlation \\

424:

425:       & & & & \quad degree range & \quad coefficient \\

426:       \vspace{-10pt} & & & & &\\ \hline

427:       \vspace{-10pt} & & & & &\\

428:

429:       $0.95$   & $14.4$    & $5735$  & $1.59$  & $25 - 100$    &  $-0.990$ \\

430:       &           &         & $1.46$  &  $100 - 500$  &  $-0.953$ \\ \hline

431:

432:       $0.90$   & $73.1$    & $10794$  & $1.58$  & $25 - 100$   &  $-0.988$ \\

433:       &           &          & $1.51$  & $100 - 500$  &  $-0.939$ \\ \hline

434:

435:         $0.85$   & $138.3$   & $16500$  & $1.68$  & $25 - 100$   &  $-0.993$ \\

436:         &           &          & $1.42$  & $100 - 800$  &  $-0.964$ \\ \hline

437:

438:         $0.80$   & $207.2$   & $ 23726$ & $1.73$  & $25 - 100$   &  $-0.994$ \\

439:         &           &          & $1.29$  & $100 - 800$  &  $-0.941$ \\ \hline

440:

441:         $0.75$   & $294.0$   & $33265$  & $1.79$  & $25 - 100$   &  $-0.997$ \\

442:         &           &          & $1.22$  & $100 - 1000$ &  $-0.956$ \\ \hline

443:

444:         $0.70$   & $395.3$   & $35202$  & $1.74$  & $25 - 100$   &  $-0.996$ \\

445:         &           &          & $1.28$  & $100 - 1000$ &  $-0.946$ \\ \hline

446:

447:         $0.65$   & $507.8$   & $36333$  & $1.71$  & $25 - 100$   &  $-0.998$ \\

448:         &           &          & $1.39$  & $100 - 1000$ &  $-0.950$ \\ \hline

449:

450:         $0.60$   & $622.3$   & $37729$  & $1.63$  & $25 - 100$   &  $-0.999$ \\

451:         &           &          & $1.32$  & $100 - 1500$ &  $-0.930$ \\ \hline

452:

453:         $0.55$   & $745.3$   & $41871$  & $1.54$  & $25 - 100$   &  $-0.998$ \\

454:         &           &          & $1.44$  & $100 - 1500$ &  $-0.927$ \\ \hline

455:

456:         $0.50$   & $911.7$   & $49895$  & $1.44$  & $25 - 100$   &  $-0.998$ \\

457:         &           &          & $1.56$  & $100 - 2000$ &  $-0.944$ \\ \hline

458:

459:         $0.45$   & $1108.1$  & $51309$  &  $1.38$  & $25 - 100$  &  $-0.998$ \\

460:         &           &          &  $1.62$  & $100 - 2000$&  $-0.951$ \\ \hline

461:

462:         $0.40$   & $1314.2$  & $51956$  & $1.28$  & $25 - 100$   &  $-0.998$ \\

463:         &           &          & $1.67$  & $100 - 2500$ &  $-0.946$ \\ \hline

464:

465:         $0.35$   & $1501.9$  & $52513$  & $1.19$  & $25 - 100$   &  $-0.998$ \\

466:         &           &          & $1.72$  & $100 - 2500$ &  $-0.961$ \\ \hline

467:

468:         $0.30$   & $1668.9$  & $60722$  & $1.08$  & $25 - 100$   &  $-0.997$ \\

469:         &           &          & $1.74$  & $100 - 3000$ &  $-0.969$ \\ \hline

470:

471:         $0.25$   & $1826.2$  & $64781$  & $0.97$  & $25 - 100$   &  $-0.997$ \\

472:         &           &          & $1.78$  & $100 - 3000$ &  $-0.969$ \\ \hline

473:       \end{tabular}

474:     \caption{\label{tab3} \small Fitting values of exponent $\alpha$ of

475:       the power law distribution of coordination degree for selected

476:       values of $\bar{w}$. We compute two linear fittings different in the

477:       choice of fitting range of coordination degree. For each fitting the

478:       range of coordination degree and its correlation coefficient are

479:       reported. In the second column the average degree is shown; the

480:       third column gives the maximum value of the coordination degree. }

481:   \end{center}

482: \end{table}

483:

484: \section{Comparison with generalized random graphs}

485:

486: It would be interesting to compare these behaviours with that of a

487: model of random graphs. It is well known that, in the classical model,

488: random graphs (where every pair of nodes is chosen to be an edge with

489: probability $p$, as introducede by Erd\"os-R\'enyi

490: \cite{erdos_renyi}), have the same expected coordination degree at

491: every node, so they are characterized by a poissonian coordination

492: degree distribution with mean value $\avg{z} \sim p V$. Futhermore, as

493: soon as $\avg{z}$ assume a value greater than $1$, a giant connected

494: component appears, that is a component whose size is much greater than

495: the size of all other components, and that represents an important

496: fraction of all graph's nodes.

497:

498: A better theorical comparison model could be represented by

499: generalized random graphs endowed with a specific degree-distribution.

500: These can be generated via the Monte-Carlo algorithm (following the

501: work in \cite{burda} of Burda et al.). In particular, starting from a

502: random graph of $V$ nodes and $E$ edges, making local graph

503: transformations which leave the number of nodes and the number of

504: edges constant and accepting them with a probability which depends on

505: the desired equilibrium degree distribution (Metropolis algorithm), we

506: have generated a collection of random graphs with the same

507: coordination degree distribution and the same average degree as some

508: of our protein sequences graphs.

509:

510: For each of them we observe a fundamentally different

511: distribution of connected components in the protein sequences graphs

512: and in the random graphs. In the latter model the power law behaviour

513: is absent, while there is a always a dominant giant connected

514: component, much larger than the many other small components, whose

515: size distribution decreases exponentially (See Figure~\ref{fig7}).

516:

517: \begin{figure}[!htb]

518:   \includegraphics[height=0.36\textwidth,angle=270]{f7.eps}

519:   \caption{\label{fig7}{\small Top: coordination degree distribution

520:       of the collection of random graphs generated via Monte-Carlo

521:       algorithm fixing the equilibrium degree distribution equal to

522:       that one observed in the protein sequences graph at $\bar{w} =

523:       0.99$ and fixing the average degree equal to $\avg{z} = 0.57$.

524:       Bottom: size distribution of connected components of the random

525:       graphs.}}

526: \end{figure}

527:

528: By comparison, in the Simap protein sequences space the coordination

529: degree distribution $f_{\bar w}(z)$ and the connected component

530: distribution $n_{\bar w}(s)$ are strongly correlated. The former, for

531: example, can be reproduced quite well by means of $n_{\bar w}(s)$. Let

532: the index $i$ label all connected components and let us consider all

533: possible edges between nodes belonging to a connected components of

534: size $s_i$; then the cluster would be a complete subgraph and all its

535: $s_i$ nodes would have coordination degree equal to $z_i = s_i-1$. If

536: this were true for all connected components then all clusters would be

537: complete subgraphs and we would expect a coordination degree

538: distribution equal to $f_{\bar w}(z) \sim ( s \; n_{\bar w}(s) ) |_{s

539:   = z + 1} $. In our graphs, although complete connected components

540: are present, the majority of clusters have only a high average degree

541: distribution, not equal to its size minus one, as in complete graphs.

542: However let's consider a component with size $s_i$ and a number of

543: edges equal to $m_i$; the quantity $\Delta_i = \frac{2 m_i}{s_i

544:   (s_i-1)}$ represents the fraction of edges that are present in the

545: $i$-th component respect to the number of edges that would be present

546: if the component were a complete subgraph (i.e. $s_i (s_i-1)/2$).

547: Introducing $\Delta_i$ as a measure of edges' density for each

548: component we can approximate the coordination degree distribution

549: $f_{\bar w}(z)$ by means of the size connected component distribution

550: $n_{\bar w}(s)$ too. Specifically we find that the coordination degree

551: distribution behaves like $f_{\bar w}(z) \sim \bar{\Delta}(z+1) \;

552: (z+1) \; n_{\bar w}(z+1) $, where $\bar{\Delta}(s)$ is the edges'

553: density averaged over all components of size $s$: $\bar{\Delta}(s) =

554: \frac{\sum_{i} \delta_{s_i, s} \Delta_i}{\sum_{i} \delta_{s_i, s}}$.

555: Figure~\ref{fig8} shows both the observed degree distribtution and the

556: approximated degree distribution obtained by means of $n(s)$ of the

557: graph at $\bar{w} = 0.95$.

558:

559: \begin{figure}[!htb]

560:   \includegraphics[height=0.36\textwidth,angle=270]{f8.eps}

561:   \caption{\label{fig8}{\small Observed degree distribution (black

562:       curve) and the approximated degree distribution (red curve)

563:       obtained by means of $n(s)$ of the graph at $\bar{w} = 0.95$.}}

564: \end{figure}

565:

566: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

567: \section{Giant component}

568:

569: \begin{figure}[!htb]

570:   \includegraphics[height=0.70\textwidth]{f9.eps}

571:   \caption{\label{fig9}{\small (a) Fraction of nodes belonging to the

572:       largest cluster for each value of $\bar{w}$. (b) Fraction of

573:       species present in the largest cluster for each value of

574:       $\bar{w}$.}}

575: \end{figure}

576:

577: An interesting phenomenon occurs when $\bar{w}$ value decrease; we see

578: the formation of the giant component.  In Figure~\ref{fig9}a the

579: behaviour of the fraction of nodes belonging to the largest component

580: is shown.

581:

582: \begin{table*}[!htb]

583:   \begin{center}

584:   \begin{tabular}{|c|c|c|c|c|c|c|c|c|}\hline

585:     \vspace{-7pt} & & & & & & & & \\

586:

587:     $\bar{w}$ & $\bar{d}$ & size & bacteria & viruses & plants &

588:     invertebrates & vertebrates & number of different species \\

589:

590:     \vspace{-7pt} & & & & & & & & \\ \hline

591:

592:     $0.975$ & $0.1581$ & $8322$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\

593:     $0.950$ & $0.2236$ & $15955$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $4$ \\

594:     $0.925$ & $0.2739$ & $47687$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $10$ \\

595:     $0.900$ & $0.3162$ & $50729$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\

596:     $0.875$ & $0.3536$ & $51028$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\

597:     $0.850$ & $0.3873$ & $51405$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $14$ \\

598:     $0.825$ & $0.4183$ & $51969$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\

599:     $0.800$ & $0.4472$ & $52097$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\

600:     $0.775$ & $0.4743$ & $52881$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $29$ \\

601:     $0.750$ & $0.5000$ & $63003$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $60$ \\

602:     $0.725$ & $0.5244$ & $118777$ & $0.000$ & $1.000$ & $0.000$ & $0.000$ & $0.000$ & $67$ \\

603:     $0.700$ & $0.5477$ & $120974$ & $0.000$ & $0.999$ & $0.000$ & $0.000$ & $0.000$ & $106$ \\

604:     $0.675$ & $0.5701$ & $145278$ & $0.002$ & $0.997$ & $0.000$ & $0.000$ & $0.000$ & $302$ \\

605:     $0.650$ & $0.5916$ & $224310$ & $0.002$ & $0.749$ & $0.001$ & $0.000$ & $0.248$ & $988$ \\

606:     $0.625$ & $0.6124$ & $272426$ & $0.014$ & $0.662$ & $0.010$ & $0.007$ & $0.306$ & $4384$ \\

607:     $0.600$ & $0.6325$ & $297280$ & $0.028$ & $0.643$ & $0.015$ & $0.011$ & $0.303$ & $7854$ \\

608:     $0.575$ & $0.6519$ & $318472$ & $0.032$ & $0.613$ & $0.027$ & $0.015$ & $0.313$ & $9668$ \\

609:     $0.550$ & $0.6708$ & $362379$ & $0.047$ & $0.554$ & $0.035$ & $0.024$ & $0.341$ & $11437$ \\

610:     $0.525$ & $0.6892$ & $404788$ & $0.049$ & $0.526$ & $0.047$ & $0.029$ & $0.349$ & $15593$ \\

611:     $0.500$ & $0.7071$ & $450072$ & $0.065$ & $0.482$ & $0.055$ & $0.033$ & $0.365$ & $16272$ \\

612:     $0.475$ & $0.7246$ & $584371$ & $0.084$ & $0.379$ & $0.151$ & $0.037$ & $0.349$ & $20957$ \\

613:     $0.450$ & $0.7416$ & $718286$ & $0.114$ & $0.312$ & $0.194$ & $0.041$ & $0.340$ & $35346$ \\

614:     $0.425$ & $0.7583$ & $975629$ & $0.151$ & $0.229$ & $0.184$ & $0.095$ & $0.341$ & $68338$ \\

615:     $0.400$ & $0.7746$ & $1202753$ & $0.181$ & $0.188$ & $0.209$ & $0.096$ & $0.326$ & $76230$ \\

616:     $0.375$ & $0.7906$ & $1435734$ & $0.210$ & $0.160$ & $0.224$ & $0.093$ & $0.312$ & $77970$ \\

617:     $0.350$ & $0.8062$ & $1739772$ & $0.254$ & $0.133$ & $0.236$ & $0.087$ & $0.291$ & $80100$ \\

618:     $0.325$ & $0.8216$ & $2059217$ & $0.288$ & $0.117$ & $0.239$ & $0.083$ & $0.273$ & $82714$ \\

619:     $0.300$ & $0.8367$ & $2383804$ & $0.316$ & $0.102$ & $0.244$ & $0.080$ & $0.258$ & $84953$ \\

620:     $0.275$ & $0.8515$ & $2728214$ & $0.350$ & $0.090$ & $0.243$ & $0.078$ & $0.239$ & $86151$ \\

621:     $0.250$ & $0.8660$ & $3071192$ & $0.374$ & $0.083$ & $0.240$ & $0.076$ & $0.226$ & $90357$ \\

622:     $0.225$ & $0.8803$ & $3420697$ & $0.396$ & $0.078$ & $0.239$ & $0.074$ & $0.213$ & $94210$ \\

623:     $0.200$ & $0.8944$ & $3807556$ & $0.416$ & $0.076$ & $0.237$ & $0.073$ & $0.199$ & $101358$ \\

624:     $0.175$ & $0.9083$ & $4210208$ & $0.432$ & $0.074$ & $0.234$ & $0.072$ & $0.188$ & $102774$ \\

625:     $0.150$ & $0.9220$ & $4651704$ & $0.446$ & $0.072$ & $0.233$ & $0.073$ & $0.177$ & $103831$ \\

626:     $0.125$ & $0.9354$ & $5049016$ & $0.455$ & $0.069$ & $0.235$ & $0.073$ & $0.167$ & $104227$ \\ \hline

627:

628: \end{tabular}

629: \caption{\label{tab4} \small For each fixed values of $bar w$, we

630:   computed the percentage of proteins, among those belonging to the

631:   largest component, that come from the five kingdoms.}

632:  \end{center}

633: \end{table*}

634:

635: Starting from approximately $\bar{w}\sim 0.65$ the largest component

636: begins to expand its size capturing a lot of smaller components.

637: Furthermore the components which are disconnetted at $\bar{w} \sim

638: 0.675$ and which go to form the giant component at $\bar{w}\sim 0.65$

639: are samples of many different sizes, from small components to very big

640: components. This phenomenon becomes more and more evident for lower

641: values of $\bar{w}$, when the coordination degree distribution of the

642: giant component follows a power law scaling.  This is evident also

643: from Figure~\ref{fig6b}, where we plot the distribution of the

644: coordination degree for the whole set of proteins. The exponent

645: $\alpha(\bar{w})$ of the power law behavior $f_{\bar w}(z) \sim

646: z^{-\alpha(\bar w)}$ varies slightly between the regions corresponding

647: to small values of the coordination degree $z$ and to large values of

648: $z$. Clearly when a giant component exists, the region with large $z$

649: is largely determined by the giant component itself. In

650: Table~\ref{tab3} we report the fitting values of the exponent

651: $\alpha(\bar{w})$ computed in two regions with small and large values

652: of $z$. As we decrease the value of $\bar w$, the two fitting values

653: of $\alpha(\bar w)$ become more and more divergent. In fact, since the

654: largest component is growing, the tail of the distribution $f_{\bar

655: w}(z)$ becomes more and more important and assumes a power law

656: behavior characterized by a different exponent.

657:

658: A significant fact goes with the rapid size increase of the largest

659: component. In Table~\ref{tab4} we show, for each $\bar{w}$, the fraction of

660: different kingdoms and the number of different species which appear in

661: the largest connected component.  Down to around $\bar{w} = 0.675 $

662: only proteins coming from viruses belong to the largest component and,

663: moreover this largest cluster has not yet become giant with respect to

664: smaller clusters. For $\bar{w} \lesssim 0.675$ the formation of a

665: giant component begins, and simultaneously all kinds of kingdoms enter

666: in the species composition of the giant cluster. This is also evident

667: from Figure~\ref{fig9}b, where we plot the fraction of the number of species

668: belonging to the largest component.  This ratio increases rapidly

669: around the same value of $\bar w$. These processes continue for lower

670: values of $\bar{w}$, with the giant component including more and more

671: proteins belonging to many different species, and the ratio for each

672: kingdom tends to become the same as that of the whole database.

673: Furthermore around $\bar w \simeq 0.475$ there is a very sharp

674: increase both in the dimension of the giant component and especially

675: in the number of species present in it, as it is evident from Figures

676: ~\ref{fig9}a and ~\ref{fig9}b.

677:

678: The processes just described may indicate the presence of a phase

679: transition: we have two different phases, one for large values of

680: $\bar w$, characterized by the presence of clusters with similar

681: dimensions and with the largest one composed especially of viruses,

682: and the second phase characterized by the presence of a giant

683: component composed of different species alongside other small little

684: clusters. We note however that the phase transition is not sharp, but

685: the changes in the dimension and composition of the largest component

686: are spread in a range $ 0.475 < \bar w <0.675$. We also note that the

687: plot in Figure~\ref{fig9}b has a very rapid increase for $w \sim 0.475$.

688:

689: \begin{table*}[!htb]

690:   \begin{center}

691:   \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}  \hline

692:

693:     $\bar{w}$ & $0.95$ & $0.90$ & $0.85$ & $0.80$ &

694:     $0.75$ & $0.70$ & $0.65$ & $0.60$ & $0.55$ &

695:     $0.50$ & $0.45$ & $0.35$ & $0.25$ & $0.15$ \\ \hline

696:

697:     bacteria & $9.6$ & $12.2$ & $14.2$ & $17.2$ & $21.9$ & $22.6$

698:     & $23.6$ & $23.8$ & $23.9$ & $25.1$ & $25.8$ & $29.0$ & $35.6$

699:     & $57.0$ \\

700:

701:     viruses & $32.7$ & $31.4$ & $24.3$ & $17.6$ & $11.4$ & $7.4$ &

702:     $5.2$ & $3.8$ & $2.9$ & $2.7$ & $2.4$ & $2.7$ & $4.2$ & $7.5$

703:     \\

704:

705:     plants & $9.3$ & $10.8$ & $11.4$ & $9.4$ & $8.3$ & $7.3$ &

706:     $7.6$ & $7.8$ & $7.7$ & $7.5$ & $7.5$ & $6.2$ & $4.0$ & $0.0$

707:     \\

708:

709:     invertebrates & $11.6$ & $8.9$ & $7.4$ & $5.8$ & $3.6$ & $3.2$

710:     & $2.5$ & $2.0$ & $1.6$ & $1.5$ & $1.2$ & $1.4$ & $1.3$ &

711:     $1.1$ \\

712:

713:     vertebrates & $22.9$ & $23.0$ & $25.4$ & $25.7$ & $25.6$ &

714:     $25.9$ & $23.6$ & $20.0$ & $17.1$ & $13.0$ & $10.2$ & $5.2$ &

715:     $2.8$ & $1.1$ \\

716:

717:     bac-vir & $2.7$ & $2.2$ & $2.1$ & $2.1$ & $1.6$ & $1.6$ &

718:     $1.4$ & $1.0$ & $1.0$ & $1.1$ & $1.0$ & $1.7$ & $2.4$ & $3.2$

719:     \\

720:

721:     bac-pla & $1.6$ & $1.8$ & $2.8$ & $2.9$ & $3.5$ & $4.5$ &

722:     $5.9$ & $7.0$ & $8.5$ & $8.9$ & $9.1$ & $10.8$ & $11.3$ &

723:     $18.3$ \\

724:

725:     bac-inv & $0.5$ & $0.4$ & $0.7$ & $0.7$ & $0.8$ & $0.9$ &

726:     $1.3$ & $1.7$ & $2.1$ & $2.1$ & $2.0$ & $2.6$ & $3.0$ & $1.1$

727:     \\

728:

729:     bac-ver & $1.8$ & $2.0$ & $2.4$ & $2.3$ & $1.9$ & $1.9$ &

730:     $1.8$ & $1.6$ & $1.5$ & $1.5$ & $1.3$ & $1.1$ & $1.1$ & $1.1$

731:     \\

732:

733:     vir-pla & $0.2$ & $0.1$ & $0.2$ & $0.4$ & $0.3$ & $0.4$ &

734:     $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.5$ & $0.0$

735:     \\

736:

737:     vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &

738:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$

739:     \\

740:

741:     vir-ver & $0.2$ & $0.5$ & $0.7$ & $0.8$ & $0.9$ & $0.7$ &

742:     $0.6$ & $0.4$ & $0.3$ & $0.2$ & $0.1$ & $0.2$ & $0.1$ & $0.0$

743:     \\

744:

745:     pla-inv & $0.9$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.2$ &

746:     $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.3$ & $0.2$ & $0.5$ & $0.0$

747:     \\

748:

749:     pla-ver & $0.5$ & $0.9$ & $0.8$ & $1.1$ & $1.3$ & $1.0$ &

750:     $1.1$ & $1.2$ & $1.2$ & $1.0$ & $0.9$ & $1.3$ & $1.7$ & $1.1$

751:     \\

752:

753:     inv-ver & $0.5$ & $1.1$ & $2.6$ & $4.5$ & $7.0$ & $8.4$ &

754:     $9.2$ & $10.3$ & $10.9$ & $11.2$ & $11.0$ & $9.0$ & $5.5$ &

755:     $0.0$ \\

756:

757:     bac-vir-pla & $0.0$ & $0.4$ & $0.3$ & $0.5$ & $0.3$ & $0.3$ &

758:     $0.4$ & $0.2$ & $0.2$ & $0.2$ & $0.4$ & $0.4$ & $0.7$ & $1.1$

759:     \\

760:

761:     bac-vir-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &

762:     $0.0$ & $0.0$ & $0.1$ & $0.0$ & $0.1$ & $0.1$ & $0.3$ & $1.1$

763:     \\

764:

765:     bac-vir-ver & $0.2$ & $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ &

766:     $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.2$ & $0.2$ & $0.0$

767:     \\

768:

769:     bac-pla-inv & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.5$ & $0.6$ &

770:     $0.8$ & $0.9$ & $1.3$ & $2.0$ & $2.3$ & $2.4$ & $3.1$ & $1.1$

771:     \\

772:

773:     bac-pla-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ & $0.3$ &

774:     $0.6$ & $0.6$ & $0.9$ & $1.0$ & $1.3$ & $1.7$ & $1.4$ & $0.0$

775:     \\

776:

777:     bac-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.3$ & $0.4$ & $0.4$ &

778:     $0.4$ & $0.9$ & $0.8$ & $0.9$ & $0.9$ & $1.0$ & $0.8$ & $1.1$

779:     \\

780:

781:     vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &

782:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$

783:     \\

784:

785:     vir-pla-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$ &

786:     $0.1$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$

787:     \\

788:

789:     vir-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.2$ & $0.1$ & $0.2$ &

790:     $0.3$ & $0.2$ & $0.2$ & $0.2$ & $0.2$ & $0.1$ & $0.1$ & $0.0$

791:     \\

792:

793:     pla-inv-ver & $0.9$ & $1.4$ & $1.8$ & $5.5$ & $7.3$ & $8.4$ &

794:     $9.4$ & $11.0$ & $11.3$ & $12.0$ & $12.4$ & $13.4$ & $11.7$ &

795:     $0.0$ \\

796:

797:     bac-vir-pla-inv & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &

798:     $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.2$ & $0.0$

799:     & $0.0$ \\

800:

801:     bac-vir-pla-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &

802:     $0.1$ & $0.2$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$ & $0.1$

803:     & $0.0$ \\

804:

805:     bac-vir-inv-ver & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ &

806:     $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.0$ & $0.1$ & $0.1$

807:     & $0.0$ \\

808:

809:     bac-pla-inv-ver & $0.2$ & $0.1$ & $0.4$ & $0.7$ & $1.0$ &

810:     $2.1$ & $2.5$ & $3.8$ & $5.1$ & $6.4$ & $8.0$ & $7.6$ & $6.7$

811:     & $0.0$ \\

812:

813:     vir-pla-inv-ver & $0.0$ & $0.1$ & $0.0$ & $0.0$ & $0.1$ &

814:     $0.1$ & $0.2$ & $0.3$ & $0.3$ & $0.3$ & $0.2$ & $0.2$ & $0.1$

815:     & $1.1$ \\

816:

817:     bac-vir-pla-inv-ver & $0.0$ & $0.0$ & $0.1$ & $0.1$ & $0.1$ &

818:     $0.2$ & $0.2$ & $0.1$ & $0.2$ & $0.3$ & $0.5$ & $0.7$ & $0.4$

819:     & $1.1$ \\ \hline

820:

821: \end{tabular}

822: \caption{\label{tab5} \small Spread of species in connected

823:   components. Each value indicates the percentage of clusters,

824:   calculated on clusters having size greater than $90$, composed by

825:   proteins coming from only one kingdom, only from a pair of kingdoms,

826:   etc., up to the percentage of clusters composed by proteins of all

827:   kingdoms.}

828:  \end{center}

829: \end{table*}

830:

831: In Table~\ref{tab5}, for each $\bar{w}$, it can be seen how different

832: kingdoms are distributed in connected components. In particular we

833: count the number of components, whose size is greater than $90$ and

834: record the percentage of clusters whose proteins come from species of

835: only one kingdom, only from a pair of kingdoms, etc., up to the

836: percentage of connected components which contain proteins of all

837: kingdoms. For high values of $\bar{w}$ the majority of clusters are

838: made up of proteins belonging to only one kingdom, in particular the

839: kingdom of viruses; clusters with proteins of different kingdoms are

840: very scarce. As expected, as $\bar{w}$ decreases, the percentage of

841: clusters belonging to only one kingdom decreases in favor of clusters

842: of mixed kingdom composition.

843:

844: It is interesting to note that the virus kingdom has a very low

845: tendency to cluster with the other kingdoms, in particular with plants

846: and animalia. Furthermore, for no values of $\bar{w}$ do we see the

847: formation of components (of size greater than $90$) with proteins

848: coming from viruses and invertebrates, and from viruses, plants and

849: invertebrates. Virus proteins cluster mainly with bacterial proteins.

850: In addition we observe that bacterial proteins cluster mainly with

851: plant proteins and vice versa. Moreover, although plant proteins

852: cluster infrequently with invertebrates and with vertebrates, there

853: are many more clusters consisting simultaneously of plant,

854: invertebrate and vertebrate proteins. Finally we note that at the

855: lowest value of $\bar{w}$, the majority of components which are not

856: included in the giant component are clusters consisting of bacterial

857: proteins, of bacterial and plant proteins and of virus proteins.

858:

859: \section{Analysis of the proteins that connect clusters}

860:

861: \begin{figure}[!htb]

862:   \begin{center}

863:     \vspace{-0.4cm}

864:     \subfigure[]{\label{fig10a}

865:       \includegraphics[width=0.48\textwidth]{f10a.eps}

866:     } \vspace{-0.4cm}

867:     \subfigure[]{\label{fig10b}

868:       \includegraphics[width=0.48\textwidth]{f10b.eps}

869:     }

870:     \caption{{\small Length representation of (a) proteins joining

871:         generic clusters and of (b) proteins joining the largest

872:         cluster. The red color encodes overrepresented lengths; the

873:         blue color indicates underrepresented lengths.}}

874:   \end{center}

875: \end{figure}

876:

877:

878: \begin{figure}[!htb]

879:   \begin{center}

880:     \vspace{-0.4cm}

881:     \subfigure[\label{fig11a}]{

882:      \includegraphics[width=0.48\textwidth]{f11a.eps}

883:     }\vspace{-0.4cm}

884:     \subfigure[\label{fig11b}]{

885:       \includegraphics[width=0.48\textwidth]{f11b.eps}

886:     }

887:   \caption{{\small Representation of the low complexity

888:       content of (a) proteins joining generic clusters and of (b)

889:       proteins joining the largest cluster. The red color encodes

890:       overrepresented values; the blue color indicates

891:       underrepresented values.  }}

892:   \end{center}

893: \end{figure}

894:

895: \begin{figure}[!htb]

896:   \begin{center}

897:     \vspace{-0.4cm}

898:     \subfigure[\label{fig12a}]{

899:      \includegraphics[width=0.48\textwidth]{f12a.eps}

900:     }\vspace{-0.4cm}

901:     \subfigure[\label{fig12b}]{

902:       \includegraphics[width=0.48\textwidth]{f12b.eps}

903:     }

904: \caption{{\small Representation of the isoelectric

905:       points of (a) proteins joining generic clusters and of (b)

906:       proteins joining the largest cluster.  The red color encodes

907:       overrepresented values; the blue color indicates

908:       underrepresented values.}}

909:   \end{center}

910: \end{figure}

911:

912: \begin{figure}[!htb]

913:   \begin{center}

914:     \vspace{-0.4cm}

915:     \subfigure[\label{fig13a}]{

916:       \includegraphics[width=0.48\textwidth]{f13a.eps}

917:     }\vspace{-0.4cm}

918:     \subfigure[\label{fig13b}]{

919:       \includegraphics[width=0.48\textwidth]{f13b.eps}

920:     }

921:     \caption{{\small Representation of the predicted number

922: 	of transmembrane helices of (a) proteins joining generic

923: 	clusters and of (b) proteins joining the largest cluster. The

924: 	red color encodes overrepresented values; the blue color

925: 	indicates underrepresented values.}}

926:   \end{center}

927: \end{figure}

928:

929:

930: \begin{figure}[!htb]

931:   \begin{center}

932:     \vspace{-0.4cm}

933:     \subfigure[\label{fig14a}]{

934:      \includegraphics[width=0.48\textwidth]{f14a.eps}

935:     }\vspace{-0.4cm}

936:     \subfigure[\label{fig14b}]{

937:       \includegraphics[width=0.48\textwidth]{f14b.eps}

938:     }

939: \caption{{\small Representation of the predicted signal peptides and

940:       protein localization signals of (a) proteins joining generic

941:       clusters and of (b) proteins joining the largest cluster. The

942:       red color encodes overrepresented values; the blue color

943:       indicates underrepresented values. }}

944:   \end{center}

945: \end{figure}

946:

947: \begin{figure}[!htb]

948:   \begin{center}

949:     \vspace{-0.4cm}

950:     \subfigure[\label{fig15a}]{

951:      \includegraphics[width=0.48\textwidth]{f15a.eps}

952:     }\vspace{-0.4cm}

953:     \subfigure[\label{fig15b}]{

954:       \includegraphics[width=0.48\textwidth]{f15b.eps}

955:     }

956: \caption{{\small Representation of the predicted protein domains of

957:       (a) proteins joining generic clusters and of (b) joining the

958:       largest cluster.  Each line in the graph denotes a certain

959:       domain. The red color encodes overrepresented values; the blue

960:       color indicates underrepresented values.}}

961:   \end{center}

962: \end{figure}

963:

964:

965: Protein pairs that connect clusters in the different weight intervals

966: are of special interest as they harbor the most conserved sequence

967: regions that are shared by the interconnected clusters. We want to

968: know if certain sequence features and protein domains are enriched in

969: these proteins compared to the complete proteome.  Therefore we have

970: calculated for all protein contained in SIMAP some sequence features:

971: \e{length}, \e{isoelectric point} (using the EMBOSS sequence analysis

972: package \cite{emboss}), \e{low complexity content} (using the program

973: seg \cite{segprog}) and the number of \e{predicted transmembrane

974: segments} (using the program TMHMM \cite{tmhmmprog}).  Additionally,

975: in order to derive functional information for all proteins, we have

976: predicted \e{signal peptides} (using SignalP 3.0 \cite{signalPprog}),

977: \e{localization signals} (using TargetP 1.1\cite{targetPprog}) and

978: \e{protein domains} (using the databases PFAM, TIGRFAM, PANTHER,

979: SUPERFAMILY, SMART and PIRSF from InterPro 12.1 \cite{pddb}) for all

980: SIMAP proteins.

981:

982: For all weight intervals we have counted the feature occurrence in the

983: proteins that connect clusters; these proteins are all pairs of

984: sequences which belong to different clusters in the graph built at

985: $\bar{w}_1$ and belonging to the same cluster in the graph built at

986: $\bar{w}_2$, where $\bar{w}_2<\bar{w}_1$ are two consecutive values of

987: the weight $\bar{w}$. We have also distinguished between two disjoint

988: sets of these proteins: proteins linking the clusters that will form

989: the largest cluster in the graph built at $\bar{w}_2$ and proteins

990: linking the other generic clusters.

991:

992: The enrichment ($e$) of features was calculated as ratio of the number

993: of features found ($k$) and the number of features expected ($k_E$):

994: $e = k/k_E$. The number of features expected was calculated by: $k_E =

995: K\,n/V$, where $n$ is the number of proteins of interest (e.g.

996: connecting clusters in a given weight interval), $K$ denotes the

997: number of proteins used for clustering having the given feature and

998: $V$ corresponds to the number of proteins used for clustering.

999:

1000: \subsection{Results}

1001:

1002: Proteins joining clusters outside the largest cluster show an

1003: over-representation of lengths around 400aa (Figure~\ref{fig10a}),

1004: contain overrepresented proteins of small low complexity content

1005: (Figure~\ref{fig11a}), are often neutral or weakly acidic

1006: (Figure~\ref{fig12a}) and contain more transmembrane proteins than

1007: expected (Figure~\ref{fig13a}).  Proteins joining clusters in the

1008: giant component are characterized by short and very long lengths

1009: (Figure~\ref{fig10b}), reduced low complexity content

1010: (Figure~\ref{fig11b}), acidic and alkaline proteins, dependent on the

1011: weight interval (Figure~\ref{fig12b}) and a high number of

1012: transmembrane domains in the lower weight intervals

1013: (Figure~\ref{fig13b}).  Signal peptides were found overrepresented in

1014: proteins joining clusters outside the largest component at the lower

1015: weight intervals; at higher weight intervals and in proteins joining

1016: clusters in the largest component they were found underrepresented, as

1017: were localization signals in all proteins joining clusters

1018: (Figure~\ref{fig14a} and Figure~\ref{fig14b}).  For all considered

1019: weight intervals we could find interval-specific overrepresented and

1020: underrepresented protein domains (Figure~\ref{fig15a} and

1021: \ref{fig15b}). Remarkably these domains are not only specific for a

1022: certain weight interval, but also different for proteins joining

1023: clusters outside the largest component and proteins joining clusters

1024: in the largest component (See Table~\ref{tab6}).

1025:

1026: \subsection{Discussion}

1027:

1028: All of the analyzed sequence features indicate that proteins that join

1029: clusters at a certain weight interval are not distributed equally over

1030: the complete protein space. For all of the features we could find

1031: specific under- and over-representation. Proteins joining clusters

1032: outside the largest component and proteins joining clusters in the

1033: largest component are different with respect to almost all considered

1034: features, which indicates that the largest component contains proteins

1035: that are different from those contained in other large clusters. These

1036: findings are complemented by the observation of specific over- and

1037: underrepresented functional domains in the proteins connecting

1038: clusters at certain weight intervals. Thus we conclude that for each

1039: weight interval a small number of protein families is responsible for

1040: cluster interconnections.

1041:

1042: %%%%%%%%%%%%%%%%%%%%%%

1043: \section{Conclusions}

1044:

1045: We investigated the local e global properties of the sequence

1046: similarity space formed by all proteins in the SIMAP database, which

1047: contains more than $5.5$ millon amino acid sequences. We represented

1048: this space as a graph whose vertices are proteins and the edges are

1049: weighted to reflect the similarity between the corresponding pairs of

1050: sequences (high weight, high similarity). The choice of this weight

1051: formula (\ref{eq:weight}) came from the necessity to compare the

1052: similarity score between pairs of sequences that could have different

1053: lengths. The SW score was therefore modified by means of the

1054: self-score geometric mean which contains the length information of the

1055: two aligned sequences.

1056:

1057: Then, keeping only edges with $w \geq \bar w$, we built a collection

1058: of graphs by varing $\bar w$. From the analysis of the connected

1059: components we found that these graphs do not belong to the class of

1060: random graphs, whereas they are characterized by a power law behaviour

1061: both in the size cluster distribution and in the coordination degree

1062: distribution and for each fixed $\bar w$ these two distributions are

1063: strongly related to each other.

1064:

1065: With the variation of $\bar w$, we found interesting changes in the

1066: global organization of the protein homology networks: we observed two

1067: different phases, one for large values of $\bar w $, characterized by

1068: the presence of clusters with similar dimensions, each composed

1069: essentially by proteins belonging to only one kingdom and with the

1070: largest one composed especially by viruses, and the second phase, for

1071: lower values of $\bar w$, characterized by the presence of a giant

1072: component composed by different species and other very little

1073: clusters.

1074:

1075: In the end we investigated sequence features and functional

1076: informations of protein pairs that are responsible of the connection

1077: of clusters in the different intervals of $\bar w$, since they harbor

1078: the most conserved sequence regions that are shared by the

1079: interconnected clusters. We found that proteins joining clusters

1080: outside the largest component and proteins joining clusters in the

1081: largest component are different with respect to almost all considered

1082: features, which indicates that the largest component contains proteins

1083: that are different from those contained in other large

1084: clusters. Indeed we found an overrepresentation of a small set of

1085: domains which shows that a small number of protein families is

1086: responsible for cluster interconnections.

1087:

1088: The analysis we performed gives a first view of the global

1089: organization of the greatest protein homology network ever been built

1090: before. It is the first step and the starting point to answer to other

1091: global or local interesting questions which could confirm that the

1092: protein homology network is structured with respect to functional and

1093: evolutionary properties.

1094:

1095:

1096: %%%%%%%%%%%%%%%%%%%%%%%%%%%

1097: \section{Acknowledgements}

1098: The authors thanks Claudio Destri, Roland Arnold and Mattia Pelizzola

1099: for useful discussions, Michele Caselle for encouraging our

1100: collaboration and Patrick Tischler, Jan Krumsiek and Benedikt

1101: Wachinger for providing the software for protein feature calculation.

1102:

1103: \newpage

1104: \begin{table*}[!htb]

1105:   \begin{center}

1106:       \begin{tabular}{|c|cc|cc|}  \hline

1107:

1108:     $\bar{w_1} \to \bar{w_2}$ & $e$ & \hspace{-0pt} Proteins joining

1109:     generic clusters & $e$ & \hspace{-0pt} Proteins joining the largest cluster \\

1110:     \hline

1111:

1112:     & $0.02$	& PF00598 Flu\_M1          & $0.93$	& PF00078 RVT\_1 \\

1113:     & $0.03$	& PF00522 VPR              & $1.08$	& PF00075 RnaseH \\

1114:     & $0.03$	& PF00540 Gag\_p17         & $1.44$	& PF06815 RVT\_connect \\

1115:     & $0.03$	& PF00951 Arteri\_Gl       & $1.46$	& PF07075 DUF1343 \\

1116:     & $0.03$	& PF00971 EIAV\_GP90       & $2.19$	& PF00665 rve \\

1117:     $0.750$ $\to$ $0.725$ & & & & \\

1118:     & $9.40$	& PF02916 DNA\_PPF         & $15.41$	& PF00607 Gag\_p24 \\

1119:     & $11.09$	& PF07095 IgaA             & $18.79$	   & PF00517 GP41 \\

1120:     & $11.25$	& PF08272 Topo\_Zn\_Ribbon & $18.91$	& PF02022 Integrase\_Zn \\

1121:     & $11.83$	& PF06899 WzyE             & $27.07$	& PF00540 Gag\_p17 \\

1122:     & $12.46$	& PF06788 UPF0257          & $137.49$	& PF00516 GP120 \\ \hline

1123:

1124:     & &                                & $0.88$	& PF00078 RVT\_1 \\

1125:     & &                                & $1.16$	& PF00077 RVP \\

1126:     & &                                & $1.91$	& PF06817 RVT\_thumb \\

1127:     & &                                & $3.68$	& PF00075 RnaseH \\

1128:     & &                                & $3.77$	& PF00665 rve \\

1129:     $0.725$ $\to$ $0.700$ & & & & \\

1130:     & &                                & $37.19$	& PF00186 DHFR\_1 \\

1131:     & &                                & $80.26$	& PF00098 zf-CCHC \\

1132:     & &                                & $129.77$	& PF00516 GP120 \\

1133:     & &                                & $139.92$	& PF00607 Gag\_p24 \\

1134:     & &                                & $145.50$	& PF00540 Gag\_p17 \\ \hline

1135:

1136:

1137:     & $0.01$	& PF00516 GP120            & $0.12$	& PF00098 zf-CCHC \\

1138:     & $0.01$	& PF00522 VPR              & $0.15$	& PF00271 Helicase\_C \\

1139:     & $0.01$	& PF00602 Flu\_PB1         & $0.22$	& PF00078 RVT\_1 \\

1140:     & $0.01$	& PF00603 Flu\_PA          & $1.02$	& PF01560 HCV\_NS1 \\

1141:     & $0.01$	& PF01539 HCV\_env         & $1.16$	& PF06817 RVT\_thumb \\

1142:     $0.700$ $\to$ $0.675$ & & & & \\

1143:     & $10.14$	& PF08435 Calici\_coat\_C  & $15.62$	& PF02907 Peptidase\_S29 \\

1144:     & $10.22$	& PF03296 Pox\_polyA\_pol  & $19.47$	& PF00517 GP41 \\

1145:     & $12.94$	& PF05733 Tenui\_N         & $57.66$	& PF00516 GP120 \\

1146:     & $12.98$	& PF03805 CLAG             & $74.03$	& PF00077 RVP \\

1147:     & $13.68$	& PF00897 Orbi\_VP7        & $98.38$	& PF02348 CTP\_transf\_3 \\ \hline

1148:

1149:     & $0.01$	& PF00064 Neur             & $0.10$	   & PF00078 RVT\_1 \\

1150:     & $0.01$	& PF00469 F-protein        & $0.13$	& PF00077 RVP \\

1151:     & $0.01$	& PF00506 Flu\_NP          & $0.18$	& PF00560 LRR\_1 \\

1152:     & $0.01$	& PF00516 GP120            & $0.18$	& PF00607 Gag\_p24 \\

1153:     & $0.01$	& PF00540 Gag\_p17         & $0.30$	& PF00665 rve \\

1154:     $0.675$ $\to$ $0.650$ & & & & \\

1155:     & $11.63$	& PF04310 MukB             & $151.92$	& PF02959 Tax \\

1156:     & $12.71$	& PF07108 PipA             & $168.64$	& PF00758 EPO\_TPO \\

1157:     & $13.48$	& PF07429 Fuc4NAc\_transf  & $431.37$	& PF08300 HCV\_NS5a\_1 \\

1158:     & $15.20$	& PF03506 Flu\_C\_NS1      & $441.03$	& PF08301 HCV\_NS5a\_1b \\

1159:     & $15.26$	& PF06593 RBDV\_coat       & $483.96$	& PF01506 HCV\_NS5a \\ \hline

1160:

1161:     & $0.01$	& PF00506 Flu\_NP    & $0.03$	& PF00096 zf-C2H2 \\

1162:     & $0.01$	& PF00516 GP120      & $0.04$	& PF00078 RVT\_1 \\

1163:     & $0.01$	& PF00540 Gag\_p17   & $0.17$	& PF00023 Ank \\

1164:     & $0.01$	& PF00603 Flu\_PA    & $0.17$	& PF00589 Phage\_integrase \\

1165:     & $0.01$	& PF00695 vMSA       & $0.19$	& PF00903 Glyoxalase \\

1166:     $0.650 $ $\to$ $ 0.625$ & & & & \\

1167:     & $12.57$	& PF06952 PsiA       & $202.08$	& PF01002 Flavi\_NS2B \\

1168:     & $13.73$	& PF06788 UPF0257    & $221.93$	& PF01349 Flavi\_NS4B \\

1169:     & $14.79$	& PF05788 Orbi\_VP1  & $222.59$	& PF01353 GFP \\

1170:     & $15.42$	& PF00901 Orbi\_VP5  & $229.23$	& PF01350 Flavi\_NS4A \\

1171:     & $16.02$	& PF03753 HHV6-IE    & $243.38$	& PF00948 Flavi\_NS1 \\ \hline

1172:

1173:  \end{tabular}

1174:   \end{center}

1175: \end{table*}

1176:

1177: \begin{table*}[!htb]

1178:   \begin{center}

1179:     \begin{tabular}{|c|cc|cc|}  \hline

1180:

1181:     & $0.01$	& PF00124 Photo\_RC         & $0.09$  & PF00009 GTP\_EFTU \\

1182:     & $0.01$	& PF00603 Flu\_PA           & $0.13$  & PF07974 EGF\_2 \\

1183:     & $0.01$	& PF00695 vMSA              & $0.2$   & PF00096 zf-C2H2 \\

1184:     & $0.01$	& PF01560 HCV\_NS1          & $0.22$  & PF00560 LRR\_1 \\

1185:     & $0.02$	& PF00223 PsaA\_PsaB        & $0.23$  & PF01546 Peptidase\_M20 \\

1186:     $0.625$ $\to$ $0.600$ & & & & \\

1187:     & $11.95$	& PF06517 Orthopox\_A43R    & $376.41$ & PF01002 Flavi\_NS2B \\

1188:     & $12.09$	& PF00843 Arena\_nucleocap  & $403.70$ & PF00948 Flavi\_NS1 \\

1189:     & $13.08$	& PF06802 DUF1231           & $411.72$ & PF01349 Flavi\_NS4B \\

1190:     & $14.72$	& PF05273 Pox\_RNA\_Pol\_22 & $425.27$ & PF01350 Flavi\_NS4A \\

1191:     & $16.90$	& PF03021 CM2               & $538.21$ & PF05408 Peptidase\_C28 \\ \hline

1192:

1193:     & $0.01$	& PF00517 GP41             & $0.06$	& PF00096 zf-C2H2 \\

1194:     & $0.01$	& PF00559 Vif              & $0.06$	& PF00097 zf-C3HC4 \\

1195:     & $0.01$	& PF00600 Flu\_NS1         & $0.09$	& PF00009 GTP\_EFTU \\

1196:     & $0.01$	& PF00969 MHC\_II\_beta    & $0.09$	& PF01266 DAO \\

1197:     & $0.01$	& PF06815 RVT\_connect     & $0.11$	& PF01926 MMR\_HSR1 \\

1198:     $0.600$ $\to$ $0.575$ & & & & \\

1199:     & $10.54$	& PF02477 Nairo\_nucleo    & $133.87$ & PF05790 C2-set \\

1200:     & $11.95$	& PF07982 Herpes\_UL74     & $139.12$ & PF01353 GFP \\

1201:     & $12.30$	& PF06871 TraH\_2          & $150.11$ & PF00518 E6 \\

1202:     & $14.14$	& PF02509 Rota\_NS35       & $195.29$ & PF02929 Bgal\_small\_N \\

1203:     & $16.04$	& PF06929 Rotavirus\_VP3   & $231.71$ & PF01382 Avidin \\ \hline

1204:

1205:     & $0.01$	& PF00016 RuBisCO\_large   & $0.02$	& PF00115 COX1 \\

1206:     & $0.01$	& PF00113 Enolase\_C       & $0.07$	& PF07690 MFS\_1 \\

1207:     & $0.01$	& PF00123 Hormone\_2       & $0.08$	& PF07993 NAD\_binding\_4 \\

1208:     & $0.01$	& PF00506 Flu\_NP          & $0.09$	& PF00517 GP41 \\

1209:     & $0.01$	& PF01010 Oxidored\_q1\_C  & $0.10$	& PF00583 Acetyltransf\_1 \\

1210:     $0.575 $ $\to$ $ 0.550$ & & & & \\

1211:     & $10.60$	& PF06134 RhaA             & $161.43$ & PF01140 Gag\_MA \\

1212:     & $10.95$	& PF07095 IgaA             & $168.19$ & PF04528 Adeno\_E4\_34 \\

1213:     & $11.75$	& PF00897 Orbi\_VP7        & $173.44$ & PF08377 MAP2\_projctn \\

1214:     & $12.13$	& PF03294 Pox\_Rap94       & $184.23$ & PF02093 Gag\_p30 \\

1215:     & $13.75$	& PF01295 Adenylate\_cycl  & $311.32$ & PF01141 Gag\_p12 \\ \hline

1216:

1217:     & $0.01$	& PF00016 RuBisCO\_large  & $0.06$	& PF00067 p450 \\

1218:     & $0.01$	& PF00516 GP120           & $0.07$	& PF00023 Ank \\

1219:     & $0.01$	& PF00522 VPR             & $0.08$	& PF00097 zf-C3HC4 \\

1220:     & $0.01$	& PF00540 Gag\_p17        & $0.11$	& PF01381 HTH\_3 \\

1221:     & $0.01$	& PF01539 HCV\_env        & $0.11$	& PF04851 ResIII \\

1222:     $0.550$ $\to$ $0.525$             &       &                         &             &  \\

1223:     & $11.29$	& PF05928 Zea\_mays\_MuDR & $101.41$	& PF01537 Herpes\_glycop\_D \\

1224:     & $11.62$	& PF06829 DUF1238         & $121.18$	& PF02929 Bgal\_small\_N \\

1225:     & $11.63$	& PF03277 Herpes\_UL4     & $123.25$	& PF01376 Enterotoxin\_b \\

1226:     & $11.64$	& PF03395 Pox\_P4A        & $128.24$	& PF06466 PCAF\_N \\

1227:     & $12.73$	& PF08405 Calici\_PP\_N   & $147.36$	& PF05806 Noggin \\ \hline

1228:

1229:

1230:     & $0.01$	& PF00600 Flu\_NS1           & $0.02$	& PF00106 adh\_short \\

1231:     & $0.01$	& PF00869 Flavi\_glycoprot   & $0.04$	& PF00270 DEAD \\

1232:     & $0.01$	& PF01539 HCV\_env           & $0.05$	& PF00037 Fer4 \\

1233:     & $0.01$	& PF02461 AMO                & $0.06$	& PF02518 HATPase\_c \\

1234:     & $0.01$	& PF02788 RuBisCO\_large\_N  & $0.08$	& PF00249 Myb\_DNA-binding \\

1235:     $0.525$ $\to$ $0.500$	&     &                           &        & \\

1236:     & $11.36$	& PF07434 CblD              & $68.92$	& PF03939 Ribosomal\_L23eN \\

1237:     & $11.80$	& PF04913 Baculo\_Y142      & $72.11$	& PF06267 DUF1028 \\

1238:     & $11.98$	& PF05880 Fiji\_64\_capsid  & $96.66$	& PF02022 Integrase\_Zn \\

1239:     & $13.48$	& PF06306 CgtA              & $120.34$	& PF00552 Integrase \\

1240:     & $13.98$	& PF03317 ELF               & $129.98$	& PF02929 Bgal\_small\_N \\ \hline

1241:   \end{tabular}

1242: \caption{\label{tab6} \small For proteins joining clusters outside the

1243:   largest component or joining the giant component the five mostly

1244:   underrepresented and five mostly overrepresented PFAM domains are

1245:   giver per interval of weight w.}

1246:  \end{center}

1247: \end{table*}

1248:

1249:

1250: \begin{thebibliography}{99}

1251:

1252: \bibitem{revEvol} E.V.~Koonin, {\it Orthologs, Paralogs, and

1253:     Evolutionary Genomics.}, {\tt Annu. Rev. Genet. 2005 39:309-38}

1254:

1255: \bibitem{simap} R.~Arnold, T.~Rattei, P.~Tischler, M.~Truong,

1256: V. St\"{u}mpflen, W.~Mewes, {\it SIMAP - The similarity matrix of

1257: proteins}, {\tt Bioinformatics {\bf 21}, ii42-ii46 (2005)}

1258:

1259: \bibitem{phn} D.~Medini, A.~Covacci, C.~Donati, {\it Protein homoloy

1260:     network families reveal step-wise diversification of type III and

1261:     type IV secretion systems.}, {\tt PLOS Computational Biology 2

1262:     1543-1551 (2006)}

1263:

1264: \bibitem{erdos_renyi} P.~Erd\"os, A.~R\'enyi, {\it On random graphs},

1265: {\tt I, Publ. Math. Debrecen {\bf 6}, 290-291 (1959)}

1266:

1267: \bibitem{burda} L.~Bogacz, Z.~Burda, W.~Janke, B.~Waclaw, {\it A

1268: program generanting homogeneous random graph with given weights},

1269: [{\tt cond-mat/0506330}].

1270:

1271: \bibitem{emboss} P.~Rice, I.~Longden, et al., {\it EMBOSS: the

1272:     European Molecular Biology Open Software Suite}, {\tt Trends Genet

1273:     16(6): 276-7 (2000)}

1274:

1275: \bibitem{segprog} J.C.~Wootton, {\it Sequences with `unusual' amino

1276:     acid compositions.}, {\tt Curr. Opin. Struct. Biol 4: 413-421

1277:     (1994)}

1278:

1279: \bibitem{tmhmmprog} A.~Krogh, B.~Larsson, et al., {\it Predicting

1280:     transmembrane protein topology with a hidden Markov model:

1281:     application to complete genomes.} , {\tt J. Mol. Biol 305(3):

1282:     567-580 (2001)}

1283:

1284: \bibitem{signalPprog} J.D.~Bendtsen, H.~Nielsen, et al., {\it Improved

1285:     prediction of signal peptides: SignalP 3.0.} , {\tt Journal of

1286:     Molecular Biology 340(4): 783-795 (2004)}

1287:

1288: \bibitem{targetPprog} O.~Emanuelsson, H.~Nielsen, et al., {\it

1289:     Predicting subcellular localization of proteins based on their

1290:     N-terminal amino acid sequence.}, {\tt Journal of Molecular

1291:     Biology 300(4): 1005-1016 (2000)}

1292:

1293: \bibitem{pddb} N.J.~Mulder, R.~Apweiler, et al., {\it InterPro,

1294:     progress and status in 2005.}, {\tt Nucleic Acids Research 33

1295:     (Database issue): D201-5 (2005)}

1296:

1297: \end{thebibliography}

1298:

1299: \end{document}

1300:

1301:

1302:

1303:

1304:

1305:

1306:

1307:

1308: