0309:q-bio0309011/m5.tex

1: \documentclass[12pt,a4paper,fleqn,twoside]{phart}

2: \usepackage{amssymb,amsmath,color}

3: \usepackage{graphicx,overcite,pxfonts}

4:

5: \definecolor{gry}{rgb}{0.3,0.3,0.3}

6: \renewcommand\leq\leqslant

7: \renewcommand\geq\geqslant

8: \renewcommand{\thefootnote}{\roman{footnote}}

9:

10: \begin{document}

11:

12: \thispagestyle{empty}

13:

14: \noindent\hspace{0.08\linewidth}

15: \begin{minipage}[l]{0.92 \linewidth}

16:

17:   \vspace{3cm}

18:

19:   \noindent\LARGE\textsf{\textbf{\textcolor{gry}{Discovery and analysis of\\

20:       biochemical subnetwork hierarchies}}}

21:

22:   \vspace{1cm}

23:

24:   \noindent

25:   \large\sffamily\textbf{Petter Holme}\newline Department of Physics, Ume{\aa}

26:     University\newline 901$\,$87 Ume{\aa}, Sweden\vspace{0.35cm}

27:

28:     \textbf{Mikael Huss}\newline SANS, NADA, Royal Institute of

29:   Technology\newline100$\,$44 Stockholm, Sweden

30:

31:   \vspace{1cm}

32:

33:   \normalsize\noindent\textbf{Abstract}

34:

35:   \rmfamily\noindent

36:   The representation of a biochemical network as a graph is the

37:   coarsest level of description in cellular biochemistry. By studying

38:   the network structure one can draw conclusions on the large scale

39:   organisation of the biochemical processes. We describe methods how

40:   one can extract hierarchies of subnetworks, how these can be

41:   interpreted and further deconstructed to find autonomous

42:   subnetworks. The large-scale organisation we find is characterised

43:   by a tightly connected core surrounded by increasingly loosely

44:   connected substrates.

45: \end{minipage}

46:

47:

48: \pagestyle{myheadings}

49: \markboth{Holme \& Huss\hspace{3.5mm}

50:   \textit{Biochemical subnetwork

51:   hierarchies}}{\textit{Biochemical subnetwork

52:   hierarchies}\hspace{3.5mm} Holme \& Huss}

53:

54: \section{Introduction}

55:

56: At the coarsest level of description, cellular biochemistry can be

57: represented as a network of vertices (substrates) linked by chemical

58: reactions. For both conceptual and analytical purposes, the vastness

59: and complexity of these biochemical networks calls for a division into

60: smaller subunits. This is nothing new---traditionally biochemists have

61: talked about functional subnetworks, the citric acid cycle being one

62: example, comprised of biochemical pathways. As modern day genomics

63: gives an increasingly comprehensive picture of the biochemical network

64: one would like to complement the traditional way of mapping out

65: subnetworks by objective graph theoretical methods. By such methods

66: we can address not only the question what relevant subnetworks there

67: are, but also the hierarchical organisation of subnetworks (can

68: subnetworks be said to consist of smaller subnetworks, and so on), and

69: also more fundamental questions about in what context the subnetwork

70: concept is relevant and when the biochemical circuitry is to be

71: considered as a functional whole.

72:

73: The graph-theoretical signature for a subnetwork is that it is

74: internally densely connected but has relatively few links to the rest

75: of the graph. Other methods for detecting

76: subnetworks~\cite{schus:dec,patra,john:hcs,bara:modhie} have been

77: based on local properties such as the number of reactions a substrate

78: takes part in, or the similarity of the neighbourhood. Since non-local

79: features can heavily affect network dynamics~\cite{holme:traffic},

80: one would prefer methods that take these into account. Here, we

81: discuss global algorithms for subnetwork detection, in particular

82: methods based on the betweenness centrality measure.

83:

84: \section{Preliminaries}

85:

86: \subsection{Biochemical networks as bipartite graphs}

87:

88: A \textit{bipartite} graph\footnote{Or, to be precise, a

89:     \textit{two-mode representation} of a bipartite graph. The formal

90:     definition of bipartiteness is just that a graph contains no odd

91:     circuits.} contains of two types of vertices and links

92: that only go between vertices of different type.

93: We represent the biochemical networks as directed bipartite graphs

94: $G=(S,R,L)$ where $S$ is a set of vertices representing substrates, $R$

95: is a set of vertices representing chemical reactions, and $L$ is the

96: set of directed links---ordered pairs of one vertex in $S$ and one vertex

97: in $R$. The links are such that if the substrates $s_1,\cdots,s_n$ are

98: involved in a reaction $r\in R$ with products

99: ${s'}_1,\cdots,{s'}_{n'}\in S$, then we have

100: $(s_1,r),\cdots,(s_n,r)\in L$ and $(r,{s'}_1),\cdots, (r,{s'}_{n'})\in

101: L$. The number of links leading to a vertex is called

102: \textit{in-degree} and denoted $k_\mathrm{in}$.

103:

104: \subsection{Betweenness centrality}

105:

106: Roughly speaking, the betweenness centrality~\cite{antonis} $C_B$ of a

107: vertex $v$ in an undirected graph is the number of shortest paths

108: between pairs of vertices that passes $v$. For the

109: purposes of this work we are interested in reaction vertices that are

110: central for paths between metabolites or other molecules; thus we

111: restrict our definition of betweenness to the reaction vertices

112: only. The precise definition then becomes:

113: \begin{equation}

114:   \label{eq:CB}

115:   C_B(r)=\sum_{s\in S}\sum_{s'\in S\setminus\{s\}}\frac{\sigma_{ss'}(r)}

116:   {\sigma_{ss'}}~,

117: \end{equation}

118: where $\sigma_{ss'}(r)$ is the number of shortest paths between $s$ and

119: $s'$ that passes through $r$, and $\sigma_{ss'}$ is the total number of

120: shortest paths between $s$ and $s'$. Since all substrates needs to be

121: present for a reaction to occur it is meaningful to rescale the

122: betweenness by the in-degree:

123: \begin{equation}

124:   \label{eq:CBeff}

125:   c_B(r)=C_B(r)/k_\mathrm{in}(r)~.

126: \end{equation}

127: We call $c_B$ the \textit{effective betweenness} of $v$.

128:

129: \subsection{Girvan and Newman's algorithm}

130:

131: The algorithm for tracing subnetworks we use is due to Girvan and

132: Newman (GN)~\cite{gir:alg}, but in a form adapted to bipartite

133: representations of biochemical networks as presented in

134: Ref.~\citen{hhj:sub}. The idea of the algorithm is based on the fact

135: that vertices that lie between densely connected areas have high

136: betweenness, and vice versa. Thus by successively removing reaction

137: vertices with high degree one will see the network disintegrate into

138: subnetworks of decreasing size. Furthermore, the smaller subnetworks

139: remaining after many iterations will be perfectly contained

140: subnetworks earlier in the execution of the algorithm, thus the method

141: produces a full hierarchy of subnetworks.

142:

143: The precise definition of the algorithm is to repeat the following

144: steps until no reaction vertices remain:

145: \begin{enumerate}

146: \item Calculate the effective betweenness $c_B(r)$ for all reaction

147:   vertices.

148: \item\label{step2} Remove the reaction vertex with highest effective

149:   betweenness and all its in- and out-going links.

150: \item\label{step3} Save information about the current state of the

151:   network.

152: \end{enumerate}

153: If many reaction vertices have the same $c_B$ in step~\ref{step2},

154: we remove all of them at once. A C-implementation of this algorithm

155: along with test data sets can be found at

156: \texttt{www.tp.umu.se/forskning/networks/meta/}.

157:

158: \section{A case-study: \textit{T.\ pallidum}}

159:

160: \begin{figure}\label{fig:tree}

161:   \centering{\resizebox*{\textwidth}{!}{\includegraphics{tr2.eps}}}

162: \caption{The hierarchical clustering tree for the metabolic network of

163:   \textit{T.\ pallidum}. The inset shows substrate names for a blow-up

164:   of the tree (indicated by black).}

165: \end{figure}

166:

167:

168: To illustrate the output of the algorithm, and how it can be

169: post-processed, we choose the metabolic network of \textit{T.\

170:   pallidum}---the pathological agent of syphilis---as obtained from

171: the WIT database~\cite{wit}\footnote{This is the same data as used in

172:   Refs.~\citen{jeong:meta,bara:modhie}, and thus slightly outdated, but

173:   it should work well for illustrating the method.}.

174:

175: \subsection{The large scale shape of the hierarchy trees}

176:

177: The subnetwork hierarchy of \textit{T.\ pallidum}'s metabolic network

178: is presented as a tree (a so-called \textit{dendrogram}) in

179: Fig.~\ref{fig:tree}. The end-points at the base of the

180: dendrogram represent the substrate vertices of the metabolic

181: network. The vertical dimension represents the hierarchical level---if

182: a horizontal line is drawn across the dendrogram, the vertices

183: connected below the line belongs to the same cluster (connected

184: subgraph) at that particular level of the hierarchical

185: organisation. The further down the tree two vertices are connected,

186: the more tightly connected are they in the biochemical network. If one

187: substrate is to be converted to another that is separated from the

188: first one high up in the in the dendrogram, then a long chain of

189: reactions is needed. If, on the other hand, the two vertices are

190: connected near the bottom of the dendrogram, then they probably are

191: both present in one or more reactions.

192:

193: The most striking feature of Fig.~\ref{fig:tree}, and indeed of any of the

194: 43 organisms of Ref.~\citen{jeong:meta}, is that the network has one

195: dominating cluster at most levels of the hierarchy. As the algorithm

196: proceeds (one goes from top to bottom of the dendrogram) a few

197: vertices at a time peel off from the largest connected cluster. The

198: emerging picture is that the large scale structure of metabolic

199: network has a tightly connected core and increasingly loosely

200: connected outer `shells.' A few rather well-defined sub-networks are

201: identified however, for example the subnetworks of Fig.~\ref{fig:tree}

202: containing reactions associated with purine metabolism and

203: pyruvate/acetyl-CoA conversion.

204:

205: \subsection{Criteria for identifying subnetworks}

206:

207: We can identify subnetworks by looking at the hierarchy tree, if a

208: subnetwork is isolated at some level (like the

209: \textit{N}-acetyl-D-glu\-cos\-amine 1-phos\-phate, D-glu\-cos\-amine

210: 1-phos\-phate, dihydrolipoamide,

211: \textit{S}-ace\-tyl\-di\-hydro\-lipo\-amide, CoA, and acetyl-CoA

212: network of Fig.~\ref{fig:tree} at level $h$) then it is comparatively

213: well connected within itself relative to its surrounding. If the cluster is

214: isolated close to the top of the dendrogram, then it is not very

215: entangled in the wirings of metabolic pathways, and likely to be a

216: reasonably autonomously functioning module. Can we establish objective

217: criteria for subnetworks to be regarded as meaningful modules?

218: For example Ref.~\citen{bara:modhie} detects modules in an indirect

219: way using a very weak criterion, roughly speaking, that substrates

220: are likely to belong to same module if they appear in reactions

221: involving the same set of other substrates. To identify groups in

222: social networks Radicci \textit{et al.}\cite{castel:comm}\ suggested two

223: criteria that, adapted to biochemical networks becomes as

224: follows: If, during the iterations of the GN algorithm, an

225: isolated vertex set $S'\subset S$ fulfils the following criterion it

226: is said to be a \textit{weak community}:

227: \begin{equation}\label{eq:weak}

228:   \sum_{s\in S'}K_\mathrm{in}(s) >  \sum_{s\in S'}K_\mathrm{out}(s)~,

229: \end{equation}

230: and a \textit{strong community} if:

231: \begin{equation}\label{eq:strong}

232:   K_\mathrm{in}(s) >  K_\mathrm{out}(s) \mbox{~for all~} s\in S'~,

233: \end{equation}

234: where $K_\mathrm{in}(s)$ is the number of $s\in S$ that are

235: products of a reaction involving a substrate $s\in S$, and

236: $K_\mathrm{out}(s)$ is the number of $s\in S\setminus S'$ that are

237: products of a reaction involving a substrate $s\in S$. Loosely

238: speaking Eq.~\ref{eq:weak} means that there are, on average, more

239: feedback pathways back into $S'$ than pathways leading out to the

240: rest of the network. If the strong condition (Eq.~\ref{eq:strong})

241: holds, then products of all reactions involving substrates $s\in S'$

242: are more likely to belong to $S'$ than not. It turns out that

243: Eq.~\ref{eq:strong} is not fulfilled for almost any cluster at any

244: but the lowest level of the hierarchy (closest to the bottom of the

245: dendrogram).  Eq.~\ref{eq:weak} is on the other hand fulfilled for

246: the largest cluster throughout all iterations of the algorithm. (This

247: picture persists for all 43 WIT organisms studied in

248: Ref.~\citen{jeong:meta}.) That the subnetworks of cellular biochemistry

249: almost completely lacks the community structure of social network, or

250: component structure of electronic devices, does not necessarily mean

251: that it is futile to talk of biochemical modules. For a subnetwork to

252: have some degree of autonomy it has to have some self-regulatory

253: function, and thus a feedback loop. To implement this idea, consider

254: the subnetworks with substrate vertex set $S'$ that fulfils:

255: \begin{equation}

256:   L(S') \leq  \Lambda|S'|~,\label{eq:l}

257: \end{equation}

258: where $L(S')$ is the number of vertices in $S'$ that lies on an

259: elementary cycle (a closed non-self-intersecting path) of only vertices

260: in $S'$ and length larger than three, $|S'|$ is the number of

261: vertices in $S'$, and the parameter $\Lambda\in[0,1]$ is the required

262: fraction of feedback loop vertices. We test the three cases where

263: $\Lambda$ equals $0$, $1/2$ and $1$, corresponding to the subnetwork

264: having at least one feedback loop, more than half of the substrates,

265: or every substrate participating in a feedback loop,

266: respectively. The largest cluster close to the top of the dendrogram

267: quite naturally fulfils Eqs.~\ref{eq:l} when $\Lambda$ small (in our

268: case $0$ or $1/2$), therefore we detect subnetworks starting from the

269: bottom of the dendrogram and go upwards.  With each one of these

270: criteria we find non-trivial subnetworks. Of the subnetworks of

271: Fig.~\ref{fig:tree} the hardest requirement, $\Lambda = 1$ detects two

272: relevant subnetworks---the one containing CoA and the innermost one

273: containing orthophosphate: $\alpha$-D-ribose 1-phosphate,

274: $\alpha$-D-ribose 1-pyrophosphate, adenine, adenosine, hypoxanthine,

275: inosine, and orthophosphate. The extended ortophosphate-subnetwork

276: still connected at level $h$ (also containing e.g.\ guanine) is

277: regarded as a valid subnetwork with $\Lambda = 1/2$, but not with

278: $\Lambda = 1$. To assign an appropriate $\Lambda$ requires a careful

279: look at the problem in question, but as a rule of thumb $\Lambda$

280: close to one seems sensible for most applications.

281:

282: \section{Conclusions}

283:

284: Finding subnetworks of cellular biochemistry is an important task for

285: modern bioinformatics, for both conceptual and analytical

286: purposes. There are two general ways to proceed, either one searches

287: for small building blocks (cf.\ Ref.~\citen{alon}) or one tries to

288: deconstruct the whole network. Our approach falls into the second

289: category. By adapting an algorithm~\cite{gir:alg} for subnetwork

290: detection to biochemical networks we construct hierarchy trees,

291: dendrograms,  representing the whole hierarchical organisation of

292: subnetworks of biochemical pathways. We find that biochemical networks

293: cannot be divided into subnetworks as easily as e.g.\ acquaintance

294: networks, and electronic circuits~\cite{hhj:sub}. Against this

295: backdrop it is not surprising that some recent criteria

296: (Eqs.~\ref{eq:weak} and \ref{eq:strong}) for extracting meaningful

297: social subnetworks fail to give non-trivial results. In remedy we

298: propose conditions based on the presence of feedback loops within a

299: subnetwork. The above methods are illustrated by an application to the

300: metabolic network of \textit{T.\ pallidum}, we have also tested them on

301: the metabolic and whole-cellular networks (containing e.g.\

302: transmembrane transport and signal transduction) of 42 other organisms

303: of the WIT database~\cite{wit}, and obtain sensible output.

304:

305: \subsection*{Acknowledgements}

306: Thanks are due to Claudio Castellano, Hawoong Jeong and Petter

307: Minn{\-}hagen. P.H.\ was partially supported by Swedish Research

308: Council through contract no.\ 2002-4135.

309:

310: \begin{thebibliography}{10}

311:

312: \bibitem{antonis}

313: J.~M. Anthonisse, {\em The rush in a directed graph}, Tech. Rep. BN 9/71,

314:   Stichting Mathematisch Centrum, 1971.

315:

316: \bibitem{gir:alg}

317: M.~Girvan and M.~E.~J. Newman, {\em Community structure in social and

318:   biological networks}, Proc. Natl. Acad. Sci. USA, {\bf 99} (2002),

319:   pp.~7821--7826.

320:

321: \bibitem{holme:traffic}

322: P.~Holme, {\em Congestion and centrality in traffic flow on complex networks},

323:   Adv. Complex Syst., {\bf 6} (2003), pp.~163--176.

324:

325: \bibitem{hhj:sub}

326: P.~Holme, M.~Huss, and H.~Jeong, {\em Subnetwork hierarchies of biochemical

327:   pathways}, Bioinformatics, {\bf 19} (2003), pp.~532--538.

328:

329: \bibitem{jeong:meta}

330: H.~Jeong, B.~Tombor, Z.~N. Oltvai, and A.-L. Barab\'{a}si, {\em The large-scale

331:   organization of metabolic networks}, Nature, {\bf 407} (2000), pp.~651--654.

332:

333: \bibitem{john:hcs}

334: S.~C. Johnson, {\em Hierarchical clustering schemes}, Psychometrika, {\bf 32}

335:   (1976), pp.~241--253.

336:

337: \bibitem{wit}

338: R.~{Overbeek {\it et al.}}, {\em {WIT}: Integrated system for high-throughput

339:   genome sequence analysis and metabolic reconstruction}, Nucleic Acids Res.,

340:   {\bf 28} (2000), pp.~123--125.

341:

342: \bibitem{patra}

343: S.~M. Patra and S.~Vishveshwara, {\em Backbone cluster identification in

344:   proteins by a graph theoretical method}, Biophys. Chem., {\bf 84} (2000),

345:   pp.~13--25.

346:

347: \bibitem{castel:comm}

348: F.~Radicchi, C.~Castellano, F.~Cecconi, V.~Loreto, and D.~Parisi, {\em Defining

349:   and identifying communities in networks}.

350: \newblock e-print cond-mat/0309488.

351:

352: \bibitem{bara:modhie}

353: E.~Ravasz, A.~L. Somera, D.~A. Mongru, Z.~N. Oltvai, and A.-L. Barab\'{a}si,

354:   {\em Hierarchical organization of modularity in metabolic networks}, Science,

355:   {\bf 297} (2002), pp.~1553--1555.

356:

357: \bibitem{schus:dec}

358: S.~Schuster, T.~Pfeiffer, F.~Moldenhauer, I.~Koch, and T.~Dandekar, {\em

359:   Exploring the pathway structure of metabolism: Decomposition into subnetworks

360:   and application to {Mycoplasma} pneumoniae}, Bioinformatics, {\bf 18} (2002),

361:   pp.~351--361.

362:

363: \bibitem{alon}

364: S.~Shen-Orr, R.~Milo, S.~Mangan, and U.~Alon, {\em Network motifs in the

365:   transcriptional regulation network of {E}scherichia coli}, Nature Genetics,

366:   {\bf 31} (2002), pp.~64--68.

367:

368: \end{thebibliography}

369:

370: \end{document}