1: \documentclass{bioinfo}
2: \copyrightyear{2007}
3: \pubyear{2007}
4:
5: \begin{document}
6: \firstpage{1}
7: \title[Short title]{Detection of the dominant direction of
8: information flow in densely interconnected regulatory networks}
9: %Making network acyclic by removing a minimal number of links}
10: \author[Ispolatov, I. and Maslov, S.]{I. Ispolatov\,$^{\rm a}$\footnote{Permanent address:
11: Departamento de Fisica, Universidad de Santiago de Chile,
12: Casilla 302, Correo 2, Santiago, Chile}, S. Maslov\,$^{\rm b}$
13: %\footnote{to whom
14: %correspondence should be addressed}
15: %, Yuryev\,$^{\rm c}$}
16: }
17: \address{$^{\rm a}$
18: %,$^{\rm c}$
19: Ariadne Genomics Inc., 9430 Key West Ave. Suite 113
20: Rockville, MD 20850, USA, $^{\rm b}$ Department of Condensed Matter Physics
21: and Materials Science,
22: Brookhaven National Laboratory, Upton, New York 11973, USA}
23: \maketitle
24:
25: \begin{abstract}
26:
27: \section{Motivation:}
28: Finding the dominant direction of flow of
29: information in densely interconnected regulatory or
30: signaling networks is required in many applications in
31: computational biology and neuroscience.
32: This is achieved by first identifying and removing links which close up
33: feedback loops in the original network and hierarchically
34: arranging nodes in the remaining network. In mathematical language
35: this corresponds to a problem of making a graph acyclic
36: by removing as few links as possible and thus
37: altering the original graph in the least possible way. Practically in all
38: applications the exact solution of this problem requires an enumeration of all
39: combinations of removed links, which is computationally intractable.
40: \section{Results:}
41: We introduce and compare two algorithms: the deterministic,
42: 'greedy' algorithm that preferentially cuts the links that participate in the
43: largest number of
44: feedback cycles, and the probabilistic one based on a simulated
45: annealing of a hierarchical layout of the network which minimizes
46: the number of ``backward'' links going from lower to higher hierarchical levels.
47: We find that the annealing algorithm outperforms the deterministic one in terms of
48: speed, memory requirement, and the actual number of removed links.
49: Implications for system biology and directions for further research are
50: discussed.
51: \section{Availability:} Source codes of $F90$ and Matlab implementation of these
52: two algorithms are available from the authors upon request.
53: \section{Contact:} \href{slava@ariadnegenomics.com},
54: \href{maslov@bnl.gov}
55: \end{abstract}
56:
57: \section{Introduction}
58: During the last several years, a substantial amount of
59: information on large-scale structure of intracellular regulatory
60: networks has been accumulated.
61: However, the growth in our understanding of how these networks manage to
62: function in a robust and specific manner was lagging
63: behind the shear rate of data acquisition. The fact these
64: networks are frequently visualized as a giant ``hairball'' (Fig. \ref{fig:01})
65: consisting of a multitude of edges, linking most constituent
66: protein-nodes to each other serves as a striking illustration of
67: the complexity of the issue at hand.
68: \begin{figure}[!tpb]
69: \centerline{\includegraphics[width=3in,angle=0]{fig1.eps}}
70: \centerline{\includegraphics[width=3in,angle=0]{fig1a.eps}}
71: \caption{Caption, A part of the post-translational regulatory network in
72: human shown here includes 1671 automatically and manually curated
73: protein modification interactions (phosphorylation, proteolytic cleavage,
74: etc.) between 732 proteins from our ResNet database
75: \citealp{Resnet}.
76: Panel A contains the ``hairball'' visualization of the
77: network structure emphasizing interconnections between
78: individual pathways. Red edges lie within the strongly connected
79: component of this network consisting of 107 proteins that could
80: all be linked to each other by a path in both directions. This makes
81: any two of these proteins to be simultaneously upstream and downstream
82: from each other. In Panel B we optimally distribute
83: proteins over a number of hierarchical levels.
84: Red arrows represent 208 putative feedback links going from lower
85: levels of the hierarchy to higher ones, while yellow
86: ones -- 512 feed-forward links jumping over one or more
87: hierarchical levels.
88: Only proteins and links reachable from one of the
89: 71 receptors placed at the top hierarchical level
90: were included.
91: }\label{fig:01}
92: \end{figure}
93:
94: To understand the functioning or even to efficiently
95: visualize a densely interconnected directed network it is
96: desirable to determine the dominant direction of information flow
97: and to identify links that go against this flow and thus close feedback loops.
98: %Regulatory edges in such networks might of rather different nature
99: %such correspond to protein modifications
100: %and transcriptional regulations for intracellular bio-molecular
101: %networks, neuron-to-neuron connections in neuronal networks, etc.
102: Ordering a network with respect to the dominant direction of information flow
103: can help to determine its previously unknown inputs and
104: outputs, to
105: %such as receptors and transcription factors in protein graphs,
106: track back hidden sources of perturbations based
107: on their observable downstream effects, etc.
108: A simple-minded hierarchical layout of a densely interconnected
109: network is often impossible due to a ubiquitous presence of feedback loops.
110: Indeed, all nodes in a strongly connected component of a network
111: by definition are simultaneously upstream and downstream of each
112: other. However, if most feedback loops are closed by relatively
113: few feedback signaling links, the dominant direction of information flow could
114: still be reconstructed based on a network topology alone.
115: An identification and removal of these relatively infrequent feedback
116: links would enable one to perform a hierarchical layout of the remaining
117: acyclic network which still sufficiently resembles the original one.
118:
119: In this work we consider the problem of identifying the minimum set of
120: links, removal of which would render a graph acyclic. In the next section we
121: introduce two rather different algorithms allowing one to approximately
122: accomplish this goal, a deterministic 'greedy'
123: algorithm and a probabilistic Metropolis annealing,
124: and compare their performance. We find that the probabilistic algorithm
125: outperforms the deterministic one in better minimizing the number of removed
126: links, and memory requirements, while maximizing the speed.
127: A simple visual example is provided for
128: the situation when the deterministic algorithm is non-optimal.
129: Following that, we discuss biological implications and applications of our
130: findings as well as how additional constraints such as {\it a priori}
131: knowledge of the function and therefore hierarchical position of certain nodes
132: may affect the algorithm performance.
133:
134: \section{Approach}
135:
136: Consider a graph of $N$ vertices labeled as $1, 2, 3, \ldots, N$ and
137: $L$ directed links labeled by pairs of vertices they connect,
138: $l_i \equiv (n_i, m_i)$. The goal is to remove as few as possible of the
139: links to make the graph acyclic, or feedback-free.
140:
141: An exact
142: way to solve this problem is to sample all possible combinations
143: of links to be removed, starting with enumerating individual links,
144: then pairs of links, etc, until the first acyclic graph is obtained.
145: Evidently, if a removal of $l$ links finally yields an acyclic graph,
146: such sampling would require checking the
147: $\sum_{i=1}^l \binom{L}{i}$ networks for cycles.
148: For the biologically relevant values
149: of $L\sim 10^3 - 10^4$ and $l\sim 10 - 10^2$ this approach is clearly
150: unfeasible. \footnote{ From an obvious identity, $\sum_{i=1}^{L/2}
151: \binom{L}{i}=2^{L-1}$, it follows that even for fairly modest $L=10^2$ and
152: $l=L/2$ the number of such attempts is $ \sim 10^{15}$.}
153:
154: \subsection{Greedy algorithm}
155:
156: A natural reduction of such exact enumeration approach is a ``greedy''
157: algorithm which performs the ``steepest descent'' in the number of cycles.
158: We implemented the following realization of such link removal algorithm:
159: \begin{itemize}
160: \item By enumerating all cycles in a graph, each link is assigned a
161: score equal to the number of cycles it is a member of.
162: \item The link with the
163: highest score is removed. When more than one link have the same
164: highest score, a link to be removed
165: is selected among the highest-scored ones by
166: random.
167: \item This procedure of cycle enumeration and link removal is repeated until
168: no cycles are found.
169: \end {itemize}
170:
171: The cycle enumeration can be implemented by following
172: all paths that originate from a given
173: vertex and recording only the cycles that
174: come back to this vertex. The procedure is repeated for each of the $N$ graph
175: vertices: evidently, each cycle of length $C$ is counted $C$ times and a
176: proper normalization is performed.
177:
178: An example of network where the greedy algorithm performs flawlessly
179: is shown in Figure \ref{fig:02}.
180:
181: \begin{figure}[!tpb]
182: \centerline{\includegraphics[width=2in,angle=0]{fig2.eps}}
183: \caption{
184: Caption, Removal of a single $(3,1)$ link makes this 3-vertex graph
185: acyclic.
186: }\label{fig:02}
187: \end{figure}
188:
189: Here the link $(3,1)$ carries the maximum score 2. A removal of
190: this link indeed makes the graph acyclic, while a removal of any other than
191: $(3,1)$ link would require a subsequent removal of the second link to achieve
192: the same goal. However, one would
193: suspect that as any ``steepest descent'' method, the proposed greedy
194: algorithm, performing a sometimes near-sighted local one-step optimization,
195: may miss the globally optimal solution. This is indeed often the case for
196: bigger
197: and more complex graphs; a fairly simple example is given in
198: Fig. \ref{fig:03}.
199:
200: \begin{figure}[!tpb]
201: \centerline{\includegraphics[width=3in,angle=0]{fig3.eps}}
202: \caption{Caption, An example of network where the greedy algorithm fails to
203: determine the optimal solution. The link $(1,2)$ carries the highest score 3
204: and thus is cut first. However, three 2-node cycles
205: $\{2,3\}$, $\{2,4\}$, and $\{2,5\}$ remain to be eliminated, after which
206: the number of removed links becomes 4. The optimal solution would
207: be to cut only three links $(2,3)$, $(2,4)$, and $(2,5)$, each carrying the
208: score 2. This optimal solution has almost always been found by the annealing
209: algorithm.
210: }\label{fig:03}
211: \end{figure}
212:
213:
214: \subsection{Simulated annealing network ordering}
215:
216: The task of finding the minimum number of links, cutting which makes the
217: graph acyclic, can be interpreted as an optimization problem and
218: tackled by probabilistic methods such as simulated annealing.
219: Evidently, there exist more than one way to define the optimization function,
220: and after exploring several possibilities we converged to the following one:
221: \begin{itemize}
222: \item For a given network,
223: a set of $M$ levels is introduced ($M\leq N$, in reality, $M\ll N$ and is of
224: the order of the graph diameter).
225: Initially, all nodes are distributed on the levels randomly.
226: \item For a particular distribution of nodes on levels, the
227: number of links
228: that go opposite to the hierarchy, that is, from a lower level
229: to the same or a higher one, is declared to be the energy $E$ of the
230: distribution, or the
231: optimization function.
232: \item A node and its new level are selected at random. A difference in energy
233: $\Delta E$
234: that would occur if the node were moved to the new level is calculated. The
235: node is moved to this new level with the probability $\min\{1, \exp (-\Delta
236: E/T\}$, where $T$ is the temperature.
237: \item After the network has been sampled a sufficient number of times (of the
238: order of $N \times M$), the temperature is reduced by some factor, usually
239: 0.9. Initially, the temperature is set sufficiently high, usually of the
240: order of the average node degree $L/N$, to allow un-obstructed level
241: changes.
242: \item When the temperature becomes low enough to inhibit any level changes,
243: the remaining ascending and in-level links are declared feedbacks and
244: removed.
245: \item The whole procedure can be repeated several times to check for
246: consistency in the assignment of feedback links and to determine the
247: lowest in the number of removed links solution.
248: \end{itemize}
249: A change of level event and the associated energy difference
250: is illustrated in Fig. \ref{fig:03d}
251: \begin{figure}[!tpb]
252: \centerline{\includegraphics[width=3in,angle=0]{fig3d.eps}}
253: \caption{Caption, Node 1 with two incoming and one outcoming link is
254: selected to move from its current position on
255: level $j$ to a new position on level $j+2$. The associate energy difference is
256: $\Delta E = -1 -1 +1 = -1$ where two -1 contributions come from making
257: $(2,1)$ and $(3,1)$ links hierarchical and the single +1 contribution comes
258: from turning the link $(1,4)$ from hierarchical to non-hierarchical.
259: }\label{fig:03d}
260: \end{figure}
261:
262: A useful property of this algorithm is that in addition to making a network
263: acyclic, it also produces a hierarchical layout. The
264: number of levels $M$ could be fixed by the requirements for such layout.
265: Otherwise, $M$ could be determined self-consistently, by observing when the
266: number of counter-hierarchical links stops to decrease upon the increase in
267: the number of levels. This is illustrated in Fig. \ref{fig:04}
268: where a plot of the
269: number of non-hierarchical links vs number of levels is presented for the
270: human protein phosphorylation network.
271:
272: \begin{figure}[!tpb]
273: \centerline{\includegraphics[width=3in,angle=0]{fig4.eps}}
274: \caption{Caption, The number of non-hierarchical links vs the number of levels
275: $M$ in the annealing layout of the combined (a union of \citealp{Peri2003}
276: and \citealp{Resnet} datasets) protein phosphorylation network in human cell.
277: The network consists of $L=2880$ links and $N=1297$ nodes (proteins).
278: The nodes with zero in-degree and zero out-degree are always put on the top
279: and bottom levels, correspondingly. The leftmost data point corresponds to the
280: single intermediate level (3 levels total), the number of non-hierarchical
281: links clearly reaches its minimum of 59 links for $M\ge 18$.
282: }\label{fig:04}
283: \end{figure}
284:
285: \section{Discussion}
286: In the previous section we introduced two algorithms intended to make a
287: network acyclic by removing the least number of links. The stochastic
288: stimulated annealing level-ordering algorithm outperforms the deterministic
289: greedy algorithm in all respects. Indeed, the greedy algorithm requires
290: tracking along all paths originating from a given vertex, which uses a lot of
291: memory and slows the performance significantly. We found it impractical to
292: apply the greedy algorithm to networks with more than 100 -- 200
293: vertices. This rules out its use for all-organism network ordering and limits
294: its utility to analyzing isolated systems and pathways. In addition, as we
295: also showed above, it often fails to find the optimum solution, while the
296: properly executed stimulated annealing always has a certain probability of
297: converging to it. That said,
298: there is a grain of biological utility in the ability to
299: determine how many cycles pass through a given link. Indeed, the demands for
300: robustness
301: in evolution of bio-molecular networks may have resulted in a
302: vast redundancy of pathways sending signals along the dominant
303: direction of information flow and thus in a relative scarcity of
304: links going in the opposite direction. Many of these
305: ``backwards'' links simultaneously close up multiple feedback loops.
306: The identification of such highly universal feedback links is
307: facilitated by the first, cycle counting stage of the greedy
308: algorithm.
309:
310: Often there exist some {\it a priori} knowledge on the hierarchical
311: positions of certain network nodes. For example, many of the
312: receptor proteins localized in the membrane upon activation pass the signals downstream
313: signaling cascades made of proteins localized in the cytoplasm and ultimately in the cell's
314: nucleus. Thus receptor proteins might have to be forcefully put on the upper levels of the
315: hierarchical layout of such signaling network. Contrary
316: to receptors, many transcription factors serve the role of effectors of
317: signaling pathways and thus must occupy the lowest levels of the hierarchy.
318: Initial, or possibly permanent, position of such nodes on the hierarchical
319: levels often helps to converge to the better in terms of fewer feedback links,
320: or more biologically relevant solution.
321:
322: In a similar way, the orientation of certain links (or equivalently, pairs of
323: nodes) could be quenched if they are known to be of the feed forward of
324: feed back nature. Based on the initial knowledge of network functioning,
325: it is also possible to assign a certain weight to a link, so that the energy
326: $E$ of a particular assignment of nodes to layers is a sum of weights of the
327: counter-hierarchical links. Thus the {\it a priori} known plausibility of a
328: link to be (or not to be) a feedback can be introduced into the layering
329: algorithm.
330:
331: It is also possible to improve the visual perception of the layout by
332: shortening the hierarchical links. In its present edition, a ``good'' or
333: hierarchical link may be arbitrary long, i.e. go down many levels, without
334: carrying any energetic penalty. This interferes with identifying the
335: hierarchical levels as certain stages of network flow. Introduction of a small
336: energetic penalty for particularly long links may alleviate this shortcoming.
337:
338: We leave these questions as well as those of particular application of ordering
339: algorithms to catalytic signaling and transcription regulation cellular
340: networks for future studies and publications.
341:
342: %\vadjust{\vfill\pagebreak}
343:
344:
345:
346: %% \section{Conclusion}
347:
348:
349: %% \begin{enumerate}
350:
351: %% \item this is item, use enumerate
352:
353: %% \item this is item, use enumerate
354:
355: %% \item this is item, use enumerate
356:
357: %% \end{enumerate}
358:
359:
360: \section*{Acknowledgement}
361: This work was supported by 1 R01 GM068954-01 grant from the NIGMS.
362: Work at Brookhaven National Laboratory was carried out under
363: Contract No. DE-AC02-98CH10886, Division of Material Science, U.S.
364: Department of Energy.
365: II thanks Theory Institute for Strongly Correlated and
366: Complex Systems at BNL for financial support during his
367: visits.
368:
369: \begin{thebibliography}{}
370:
371: \bibitem[Nikitin {\it et~al}., 2003]{Resnet}
372: Nikitin, A., et al (2003) Pathway studio - the analysis and navigation of
373: molecular networks {\it Bioinformatics} {\bf 19}, 1-3.
374:
375: \bibitem[Peri {\it et~al}., 2003]{Peri2003} Peri, S. et al. (2003) Development
376: of human protein reference database as an initial platform for approaching
377: systems biology in humans. {\it Genome Research} {\bf 13}, 2363-2371.
378:
379: \end{thebibliography}
380:
381: \end{document}
382: