1: \documentclass[12pt,a4paper,fleqn,twoside]{phart}
2: \usepackage{amssymb,amsmath,color}
3: \usepackage{graphicx,overcite,pxfonts}
4:
5: \definecolor{gry}{rgb}{0.3,0.3,0.3}
6: \renewcommand\leq\leqslant
7: \renewcommand\geq\geqslant
8: \renewcommand{\thefootnote}{\roman{footnote}}
9:
10: \begin{document}
11:
12: \thispagestyle{empty}
13:
14: \noindent\hspace{0.08\linewidth}
15: \begin{minipage}[l]{0.92 \linewidth}
16:
17: \vspace{3cm}
18:
19: \noindent\LARGE\textsf{\textbf{\textcolor{gry}{Discovery and analysis of\\
20: biochemical subnetwork hierarchies}}}
21:
22: \vspace{1cm}
23:
24: \noindent
25: \large\sffamily\textbf{Petter Holme}\newline Department of Physics, Ume{\aa}
26: University\newline 901$\,$87 Ume{\aa}, Sweden\vspace{0.35cm}
27:
28: \textbf{Mikael Huss}\newline SANS, NADA, Royal Institute of
29: Technology\newline100$\,$44 Stockholm, Sweden
30:
31: \vspace{1cm}
32:
33: \normalsize\noindent\textbf{Abstract}
34:
35: \rmfamily\noindent
36: The representation of a biochemical network as a graph is the
37: coarsest level of description in cellular biochemistry. By studying
38: the network structure one can draw conclusions on the large scale
39: organisation of the biochemical processes. We describe methods how
40: one can extract hierarchies of subnetworks, how these can be
41: interpreted and further deconstructed to find autonomous
42: subnetworks. The large-scale organisation we find is characterised
43: by a tightly connected core surrounded by increasingly loosely
44: connected substrates.
45: \end{minipage}
46:
47:
48: \pagestyle{myheadings}
49: \markboth{Holme \& Huss\hspace{3.5mm}
50: \textit{Biochemical subnetwork
51: hierarchies}}{\textit{Biochemical subnetwork
52: hierarchies}\hspace{3.5mm} Holme \& Huss}
53:
54: \section{Introduction}
55:
56: At the coarsest level of description, cellular biochemistry can be
57: represented as a network of vertices (substrates) linked by chemical
58: reactions. For both conceptual and analytical purposes, the vastness
59: and complexity of these biochemical networks calls for a division into
60: smaller subunits. This is nothing new---traditionally biochemists have
61: talked about functional subnetworks, the citric acid cycle being one
62: example, comprised of biochemical pathways. As modern day genomics
63: gives an increasingly comprehensive picture of the biochemical network
64: one would like to complement the traditional way of mapping out
65: subnetworks by objective graph theoretical methods. By such methods
66: we can address not only the question what relevant subnetworks there
67: are, but also the hierarchical organisation of subnetworks (can
68: subnetworks be said to consist of smaller subnetworks, and so on), and
69: also more fundamental questions about in what context the subnetwork
70: concept is relevant and when the biochemical circuitry is to be
71: considered as a functional whole.
72:
73: The graph-theoretical signature for a subnetwork is that it is
74: internally densely connected but has relatively few links to the rest
75: of the graph. Other methods for detecting
76: subnetworks~\cite{schus:dec,patra,john:hcs,bara:modhie} have been
77: based on local properties such as the number of reactions a substrate
78: takes part in, or the similarity of the neighbourhood. Since non-local
79: features can heavily affect network dynamics~\cite{holme:traffic},
80: one would prefer methods that take these into account. Here, we
81: discuss global algorithms for subnetwork detection, in particular
82: methods based on the betweenness centrality measure.
83:
84: \section{Preliminaries}
85:
86: \subsection{Biochemical networks as bipartite graphs}
87:
88: A \textit{bipartite} graph\footnote{Or, to be precise, a
89: \textit{two-mode representation} of a bipartite graph. The formal
90: definition of bipartiteness is just that a graph contains no odd
91: circuits.} contains of two types of vertices and links
92: that only go between vertices of different type.
93: We represent the biochemical networks as directed bipartite graphs
94: $G=(S,R,L)$ where $S$ is a set of vertices representing substrates, $R$
95: is a set of vertices representing chemical reactions, and $L$ is the
96: set of directed links---ordered pairs of one vertex in $S$ and one vertex
97: in $R$. The links are such that if the substrates $s_1,\cdots,s_n$ are
98: involved in a reaction $r\in R$ with products
99: ${s'}_1,\cdots,{s'}_{n'}\in S$, then we have
100: $(s_1,r),\cdots,(s_n,r)\in L$ and $(r,{s'}_1),\cdots, (r,{s'}_{n'})\in
101: L$. The number of links leading to a vertex is called
102: \textit{in-degree} and denoted $k_\mathrm{in}$.
103:
104: \subsection{Betweenness centrality}
105:
106: Roughly speaking, the betweenness centrality~\cite{antonis} $C_B$ of a
107: vertex $v$ in an undirected graph is the number of shortest paths
108: between pairs of vertices that passes $v$. For the
109: purposes of this work we are interested in reaction vertices that are
110: central for paths between metabolites or other molecules; thus we
111: restrict our definition of betweenness to the reaction vertices
112: only. The precise definition then becomes:
113: \begin{equation}
114: \label{eq:CB}
115: C_B(r)=\sum_{s\in S}\sum_{s'\in S\setminus\{s\}}\frac{\sigma_{ss'}(r)}
116: {\sigma_{ss'}}~,
117: \end{equation}
118: where $\sigma_{ss'}(r)$ is the number of shortest paths between $s$ and
119: $s'$ that passes through $r$, and $\sigma_{ss'}$ is the total number of
120: shortest paths between $s$ and $s'$. Since all substrates needs to be
121: present for a reaction to occur it is meaningful to rescale the
122: betweenness by the in-degree:
123: \begin{equation}
124: \label{eq:CBeff}
125: c_B(r)=C_B(r)/k_\mathrm{in}(r)~.
126: \end{equation}
127: We call $c_B$ the \textit{effective betweenness} of $v$.
128:
129: \subsection{Girvan and Newman's algorithm}
130:
131: The algorithm for tracing subnetworks we use is due to Girvan and
132: Newman (GN)~\cite{gir:alg}, but in a form adapted to bipartite
133: representations of biochemical networks as presented in
134: Ref.~\citen{hhj:sub}. The idea of the algorithm is based on the fact
135: that vertices that lie between densely connected areas have high
136: betweenness, and vice versa. Thus by successively removing reaction
137: vertices with high degree one will see the network disintegrate into
138: subnetworks of decreasing size. Furthermore, the smaller subnetworks
139: remaining after many iterations will be perfectly contained
140: subnetworks earlier in the execution of the algorithm, thus the method
141: produces a full hierarchy of subnetworks.
142:
143: The precise definition of the algorithm is to repeat the following
144: steps until no reaction vertices remain:
145: \begin{enumerate}
146: \item Calculate the effective betweenness $c_B(r)$ for all reaction
147: vertices.
148: \item\label{step2} Remove the reaction vertex with highest effective
149: betweenness and all its in- and out-going links.
150: \item\label{step3} Save information about the current state of the
151: network.
152: \end{enumerate}
153: If many reaction vertices have the same $c_B$ in step~\ref{step2},
154: we remove all of them at once. A C-implementation of this algorithm
155: along with test data sets can be found at
156: \texttt{www.tp.umu.se/forskning/networks/meta/}.
157:
158: \section{A case-study: \textit{T.\ pallidum}}
159:
160: \begin{figure}\label{fig:tree}
161: \centering{\resizebox*{\textwidth}{!}{\includegraphics{tr2.eps}}}
162: \caption{The hierarchical clustering tree for the metabolic network of
163: \textit{T.\ pallidum}. The inset shows substrate names for a blow-up
164: of the tree (indicated by black).}
165: \end{figure}
166:
167:
168: To illustrate the output of the algorithm, and how it can be
169: post-processed, we choose the metabolic network of \textit{T.\
170: pallidum}---the pathological agent of syphilis---as obtained from
171: the WIT database~\cite{wit}\footnote{This is the same data as used in
172: Refs.~\citen{jeong:meta,bara:modhie}, and thus slightly outdated, but
173: it should work well for illustrating the method.}.
174:
175: \subsection{The large scale shape of the hierarchy trees}
176:
177: The subnetwork hierarchy of \textit{T.\ pallidum}'s metabolic network
178: is presented as a tree (a so-called \textit{dendrogram}) in
179: Fig.~\ref{fig:tree}. The end-points at the base of the
180: dendrogram represent the substrate vertices of the metabolic
181: network. The vertical dimension represents the hierarchical level---if
182: a horizontal line is drawn across the dendrogram, the vertices
183: connected below the line belongs to the same cluster (connected
184: subgraph) at that particular level of the hierarchical
185: organisation. The further down the tree two vertices are connected,
186: the more tightly connected are they in the biochemical network. If one
187: substrate is to be converted to another that is separated from the
188: first one high up in the in the dendrogram, then a long chain of
189: reactions is needed. If, on the other hand, the two vertices are
190: connected near the bottom of the dendrogram, then they probably are
191: both present in one or more reactions.
192:
193: The most striking feature of Fig.~\ref{fig:tree}, and indeed of any of the
194: 43 organisms of Ref.~\citen{jeong:meta}, is that the network has one
195: dominating cluster at most levels of the hierarchy. As the algorithm
196: proceeds (one goes from top to bottom of the dendrogram) a few
197: vertices at a time peel off from the largest connected cluster. The
198: emerging picture is that the large scale structure of metabolic
199: network has a tightly connected core and increasingly loosely
200: connected outer `shells.' A few rather well-defined sub-networks are
201: identified however, for example the subnetworks of Fig.~\ref{fig:tree}
202: containing reactions associated with purine metabolism and
203: pyruvate/acetyl-CoA conversion.
204:
205: \subsection{Criteria for identifying subnetworks}
206:
207: We can identify subnetworks by looking at the hierarchy tree, if a
208: subnetwork is isolated at some level (like the
209: \textit{N}-acetyl-D-glu\-cos\-amine 1-phos\-phate, D-glu\-cos\-amine
210: 1-phos\-phate, dihydrolipoamide,
211: \textit{S}-ace\-tyl\-di\-hydro\-lipo\-amide, CoA, and acetyl-CoA
212: network of Fig.~\ref{fig:tree} at level $h$) then it is comparatively
213: well connected within itself relative to its surrounding. If the cluster is
214: isolated close to the top of the dendrogram, then it is not very
215: entangled in the wirings of metabolic pathways, and likely to be a
216: reasonably autonomously functioning module. Can we establish objective
217: criteria for subnetworks to be regarded as meaningful modules?
218: For example Ref.~\citen{bara:modhie} detects modules in an indirect
219: way using a very weak criterion, roughly speaking, that substrates
220: are likely to belong to same module if they appear in reactions
221: involving the same set of other substrates. To identify groups in
222: social networks Radicci \textit{et al.}\cite{castel:comm}\ suggested two
223: criteria that, adapted to biochemical networks becomes as
224: follows: If, during the iterations of the GN algorithm, an
225: isolated vertex set $S'\subset S$ fulfils the following criterion it
226: is said to be a \textit{weak community}:
227: \begin{equation}\label{eq:weak}
228: \sum_{s\in S'}K_\mathrm{in}(s) > \sum_{s\in S'}K_\mathrm{out}(s)~,
229: \end{equation}
230: and a \textit{strong community} if:
231: \begin{equation}\label{eq:strong}
232: K_\mathrm{in}(s) > K_\mathrm{out}(s) \mbox{~for all~} s\in S'~,
233: \end{equation}
234: where $K_\mathrm{in}(s)$ is the number of $s\in S$ that are
235: products of a reaction involving a substrate $s\in S$, and
236: $K_\mathrm{out}(s)$ is the number of $s\in S\setminus S'$ that are
237: products of a reaction involving a substrate $s\in S$. Loosely
238: speaking Eq.~\ref{eq:weak} means that there are, on average, more
239: feedback pathways back into $S'$ than pathways leading out to the
240: rest of the network. If the strong condition (Eq.~\ref{eq:strong})
241: holds, then products of all reactions involving substrates $s\in S'$
242: are more likely to belong to $S'$ than not. It turns out that
243: Eq.~\ref{eq:strong} is not fulfilled for almost any cluster at any
244: but the lowest level of the hierarchy (closest to the bottom of the
245: dendrogram). Eq.~\ref{eq:weak} is on the other hand fulfilled for
246: the largest cluster throughout all iterations of the algorithm. (This
247: picture persists for all 43 WIT organisms studied in
248: Ref.~\citen{jeong:meta}.) That the subnetworks of cellular biochemistry
249: almost completely lacks the community structure of social network, or
250: component structure of electronic devices, does not necessarily mean
251: that it is futile to talk of biochemical modules. For a subnetwork to
252: have some degree of autonomy it has to have some self-regulatory
253: function, and thus a feedback loop. To implement this idea, consider
254: the subnetworks with substrate vertex set $S'$ that fulfils:
255: \begin{equation}
256: L(S') \leq \Lambda|S'|~,\label{eq:l}
257: \end{equation}
258: where $L(S')$ is the number of vertices in $S'$ that lies on an
259: elementary cycle (a closed non-self-intersecting path) of only vertices
260: in $S'$ and length larger than three, $|S'|$ is the number of
261: vertices in $S'$, and the parameter $\Lambda\in[0,1]$ is the required
262: fraction of feedback loop vertices. We test the three cases where
263: $\Lambda$ equals $0$, $1/2$ and $1$, corresponding to the subnetwork
264: having at least one feedback loop, more than half of the substrates,
265: or every substrate participating in a feedback loop,
266: respectively. The largest cluster close to the top of the dendrogram
267: quite naturally fulfils Eqs.~\ref{eq:l} when $\Lambda$ small (in our
268: case $0$ or $1/2$), therefore we detect subnetworks starting from the
269: bottom of the dendrogram and go upwards. With each one of these
270: criteria we find non-trivial subnetworks. Of the subnetworks of
271: Fig.~\ref{fig:tree} the hardest requirement, $\Lambda = 1$ detects two
272: relevant subnetworks---the one containing CoA and the innermost one
273: containing orthophosphate: $\alpha$-D-ribose 1-phosphate,
274: $\alpha$-D-ribose 1-pyrophosphate, adenine, adenosine, hypoxanthine,
275: inosine, and orthophosphate. The extended ortophosphate-subnetwork
276: still connected at level $h$ (also containing e.g.\ guanine) is
277: regarded as a valid subnetwork with $\Lambda = 1/2$, but not with
278: $\Lambda = 1$. To assign an appropriate $\Lambda$ requires a careful
279: look at the problem in question, but as a rule of thumb $\Lambda$
280: close to one seems sensible for most applications.
281:
282: \section{Conclusions}
283:
284: Finding subnetworks of cellular biochemistry is an important task for
285: modern bioinformatics, for both conceptual and analytical
286: purposes. There are two general ways to proceed, either one searches
287: for small building blocks (cf.\ Ref.~\citen{alon}) or one tries to
288: deconstruct the whole network. Our approach falls into the second
289: category. By adapting an algorithm~\cite{gir:alg} for subnetwork
290: detection to biochemical networks we construct hierarchy trees,
291: dendrograms, representing the whole hierarchical organisation of
292: subnetworks of biochemical pathways. We find that biochemical networks
293: cannot be divided into subnetworks as easily as e.g.\ acquaintance
294: networks, and electronic circuits~\cite{hhj:sub}. Against this
295: backdrop it is not surprising that some recent criteria
296: (Eqs.~\ref{eq:weak} and \ref{eq:strong}) for extracting meaningful
297: social subnetworks fail to give non-trivial results. In remedy we
298: propose conditions based on the presence of feedback loops within a
299: subnetwork. The above methods are illustrated by an application to the
300: metabolic network of \textit{T.\ pallidum}, we have also tested them on
301: the metabolic and whole-cellular networks (containing e.g.\
302: transmembrane transport and signal transduction) of 42 other organisms
303: of the WIT database~\cite{wit}, and obtain sensible output.
304:
305: \subsection*{Acknowledgements}
306: Thanks are due to Claudio Castellano, Hawoong Jeong and Petter
307: Minn{\-}hagen. P.H.\ was partially supported by Swedish Research
308: Council through contract no.\ 2002-4135.
309:
310: \begin{thebibliography}{10}
311:
312: \bibitem{antonis}
313: J.~M. Anthonisse, {\em The rush in a directed graph}, Tech. Rep. BN 9/71,
314: Stichting Mathematisch Centrum, 1971.
315:
316: \bibitem{gir:alg}
317: M.~Girvan and M.~E.~J. Newman, {\em Community structure in social and
318: biological networks}, Proc. Natl. Acad. Sci. USA, {\bf 99} (2002),
319: pp.~7821--7826.
320:
321: \bibitem{holme:traffic}
322: P.~Holme, {\em Congestion and centrality in traffic flow on complex networks},
323: Adv. Complex Syst., {\bf 6} (2003), pp.~163--176.
324:
325: \bibitem{hhj:sub}
326: P.~Holme, M.~Huss, and H.~Jeong, {\em Subnetwork hierarchies of biochemical
327: pathways}, Bioinformatics, {\bf 19} (2003), pp.~532--538.
328:
329: \bibitem{jeong:meta}
330: H.~Jeong, B.~Tombor, Z.~N. Oltvai, and A.-L. Barab\'{a}si, {\em The large-scale
331: organization of metabolic networks}, Nature, {\bf 407} (2000), pp.~651--654.
332:
333: \bibitem{john:hcs}
334: S.~C. Johnson, {\em Hierarchical clustering schemes}, Psychometrika, {\bf 32}
335: (1976), pp.~241--253.
336:
337: \bibitem{wit}
338: R.~{Overbeek {\it et al.}}, {\em {WIT}: Integrated system for high-throughput
339: genome sequence analysis and metabolic reconstruction}, Nucleic Acids Res.,
340: {\bf 28} (2000), pp.~123--125.
341:
342: \bibitem{patra}
343: S.~M. Patra and S.~Vishveshwara, {\em Backbone cluster identification in
344: proteins by a graph theoretical method}, Biophys. Chem., {\bf 84} (2000),
345: pp.~13--25.
346:
347: \bibitem{castel:comm}
348: F.~Radicchi, C.~Castellano, F.~Cecconi, V.~Loreto, and D.~Parisi, {\em Defining
349: and identifying communities in networks}.
350: \newblock e-print cond-mat/0309488.
351:
352: \bibitem{bara:modhie}
353: E.~Ravasz, A.~L. Somera, D.~A. Mongru, Z.~N. Oltvai, and A.-L. Barab\'{a}si,
354: {\em Hierarchical organization of modularity in metabolic networks}, Science,
355: {\bf 297} (2002), pp.~1553--1555.
356:
357: \bibitem{schus:dec}
358: S.~Schuster, T.~Pfeiffer, F.~Moldenhauer, I.~Koch, and T.~Dandekar, {\em
359: Exploring the pathway structure of metabolism: Decomposition into subnetworks
360: and application to {Mycoplasma} pneumoniae}, Bioinformatics, {\bf 18} (2002),
361: pp.~351--361.
362:
363: \bibitem{alon}
364: S.~Shen-Orr, R.~Milo, S.~Mangan, and U.~Alon, {\em Network motifs in the
365: transcriptional regulation network of {E}scherichia coli}, Nature Genetics,
366: {\bf 31} (2002), pp.~64--68.
367:
368: \end{thebibliography}
369:
370: \end{document}