0702:q-bio0702057/text.tex

1: \documentclass{bioinfo}

2: \copyrightyear{2007}

3: \pubyear{2007}

4:

5: \begin{document}

6: \firstpage{1}

7: \title[Short title]{Detection of the dominant direction of

8: information flow in densely interconnected regulatory networks}

9: %Making network acyclic by removing a minimal number of links}

10: \author[Ispolatov, I. and Maslov, S.]{I. Ispolatov\,$^{\rm a}$\footnote{Permanent address:

11: Departamento de Fisica, Universidad de Santiago de Chile,

12: Casilla 302, Correo 2, Santiago, Chile}, S. Maslov\,$^{\rm b}$

13: %\footnote{to whom

14: %correspondence should be addressed}

15: %, Yuryev\,$^{\rm c}$}

16: }

17: \address{$^{\rm a}$

18: %,$^{\rm c}$

19: Ariadne Genomics Inc., 9430 Key West Ave.  Suite 113

20: Rockville, MD 20850, USA, $^{\rm b}$ Department of Condensed Matter Physics

21: and Materials Science,

22: Brookhaven National Laboratory, Upton, New York 11973, USA}

23: \maketitle

24:

25: \begin{abstract}

26:

27: \section{Motivation:}

28: Finding the dominant direction of flow of

29: information in densely interconnected regulatory or

30: signaling networks is required in many applications in

31: computational biology and neuroscience.

32: This is achieved by first identifying and removing links which close up

33: feedback loops in the original network and hierarchically

34: arranging nodes in the remaining network. In mathematical language

35: this corresponds to a problem of making a graph acyclic

36: by removing as few links as possible and thus

37: altering the original graph in the least possible way. Practically in all

38: applications the exact solution of this problem requires an enumeration of all

39: combinations of removed links, which is computationally intractable.

40: \section{Results:}

41: We introduce and compare two algorithms: the deterministic,

42: 'greedy' algorithm that preferentially cuts the links that participate in the

43: largest number of

44: feedback cycles, and the probabilistic one based on a simulated

45: annealing of a hierarchical layout of the network which minimizes

46: the number of ``backward'' links going from lower to higher hierarchical levels.

47: We find that the annealing algorithm outperforms the deterministic one in terms of

48: speed, memory requirement, and the actual number of removed links.

49: Implications  for system biology and directions for further research are

50: discussed.

51: \section{Availability:} Source codes of $F90$ and Matlab implementation of these

52: two algorithms are available from the authors upon request.

53: \section{Contact:} \href{slava@ariadnegenomics.com},

54: \href{maslov@bnl.gov}

55: \end{abstract}

56:

57: \section{Introduction}

58: During the last several years, a substantial amount of

59: information on large-scale structure of intracellular regulatory

60: networks has been accumulated.

61: However, the growth in our understanding of how these networks manage to

62: function in a robust and specific manner was lagging

63: behind the shear rate of data acquisition. The fact these

64: networks are frequently visualized as a giant ``hairball'' (Fig. \ref{fig:01})

65: consisting of a multitude of edges, linking most constituent

66: protein-nodes to each other serves as a striking illustration of

67: the complexity of the issue at hand.

68: \begin{figure}[!tpb]

69: \centerline{\includegraphics[width=3in,angle=0]{fig1.eps}}

70: \centerline{\includegraphics[width=3in,angle=0]{fig1a.eps}}

71: \caption{Caption, A part of the post-translational regulatory network in

72: human shown here includes 1671 automatically and manually curated

73: protein modification interactions (phosphorylation, proteolytic cleavage,

74: etc.) between 732 proteins from our ResNet database

75: \citealp{Resnet}.

76: Panel A contains the ``hairball'' visualization of the

77: network structure emphasizing interconnections between

78: individual pathways. Red edges lie within the strongly connected

79: component of this network consisting of 107 proteins that could

80: all be linked to each other by a path in both directions. This makes

81: any two of these proteins to be simultaneously upstream and downstream

82: from each other. In Panel B we optimally distribute

83: proteins over a number of hierarchical levels.

84: Red arrows represent 208 putative feedback links going from lower

85: levels of the hierarchy to higher ones, while yellow

86: ones -- 512 feed-forward links jumping over one or more

87: hierarchical levels.

88: Only proteins and links reachable from one of the

89: 71 receptors placed at the top hierarchical level

90: were included.

91: }\label{fig:01}

92: \end{figure}

93:

94: To understand the functioning or even to efficiently

95: visualize a densely interconnected directed  network it is

96: desirable to determine the dominant direction of information flow

97: and to identify links that go against this flow and thus close feedback loops.

98: %Regulatory edges in such networks might of rather different nature

99: %such correspond to protein modifications

100: %and transcriptional regulations for intracellular bio-molecular

101: %networks, neuron-to-neuron connections in neuronal networks, etc.

102: Ordering a network with respect to the dominant direction of information flow

103: can help to determine its previously unknown inputs and

104: outputs, to

105: %such as receptors and transcription factors in protein graphs,

106: track back hidden sources of perturbations based

107: on their observable downstream effects, etc.

108: A simple-minded hierarchical layout of a densely interconnected

109: network is often impossible due to a ubiquitous presence of feedback loops.

110: Indeed, all nodes in a strongly connected component of a network

111: by definition are simultaneously upstream and downstream of each

112: other. However, if most feedback loops are closed by relatively

113: few feedback signaling links, the dominant direction of information flow could

114: still be reconstructed based on a network topology alone.

115: An identification and removal of these relatively infrequent feedback

116: links would enable one to perform a hierarchical layout of the remaining

117: acyclic network which still sufficiently resembles the original one.

118:

119: In this work we consider the problem of identifying the minimum set of

120: links, removal of which would render a graph acyclic. In the next section we

121: introduce two rather different algorithms allowing one to approximately

122: accomplish this goal, a deterministic 'greedy'

123: algorithm and a probabilistic Metropolis annealing,

124: and compare their performance. We find that the probabilistic algorithm

125: outperforms the deterministic one in better minimizing the number of removed

126: links, and memory requirements, while maximizing the speed.

127: A simple visual example is provided for

128: the situation when the deterministic algorithm is non-optimal.

129: Following that, we discuss biological implications and applications of our

130: findings as well as how additional constraints such as {\it a priori}

131: knowledge of the function and therefore hierarchical position of certain nodes

132: may affect the algorithm performance.

133:

134: \section{Approach}

135:

136: Consider a graph of $N$ vertices labeled as $1, 2, 3, \ldots, N$ and

137: $L$ directed links labeled by pairs of vertices they connect,

138: $l_i \equiv (n_i, m_i)$. The goal is to remove as few as possible of the

139: links to make the graph acyclic, or feedback-free.

140:

141: An exact

142: way to solve this problem is to sample all possible combinations

143: of links to be removed, starting with enumerating individual links,

144: then pairs of links, etc, until the first acyclic graph is obtained.

145: Evidently, if a removal of $l$ links finally yields an acyclic graph,

146: such sampling would require checking the

147: $\sum_{i=1}^l \binom{L}{i}$ networks for cycles.

148: For the biologically relevant values

149: of $L\sim 10^3 - 10^4$ and $l\sim 10 - 10^2$ this approach is clearly

150: unfeasible. \footnote{ From an obvious identity, $\sum_{i=1}^{L/2}

151: \binom{L}{i}=2^{L-1}$, it follows that even for fairly modest $L=10^2$ and

152:   $l=L/2$ the number of such attempts is $ \sim 10^{15}$.}

153:

154: \subsection{Greedy algorithm}

155:

156: A natural reduction of such exact enumeration approach is a ``greedy''

157: algorithm which performs the  ``steepest descent'' in  the number of cycles.

158: We implemented the following realization of such link removal algorithm:

159: \begin{itemize}

160: \item By enumerating all cycles in a graph, each link is assigned a

161: score equal to the number of cycles it is a member of.

162: \item The link with the

163: highest score is removed. When more than one link have the same

164: highest score, a link to be removed

165: is selected among the highest-scored ones by

166: random.

167: \item This procedure of cycle enumeration and link removal is repeated until

168: no cycles  are found.

169: \end {itemize}

170:

171: The cycle enumeration can be implemented by following

172: all paths that originate from a given

173: vertex and recording only the cycles that

174: come back to this vertex. The procedure is repeated for each of the $N$ graph

175: vertices: evidently, each cycle of length $C$ is counted $C$ times and a

176: proper normalization is performed.

177:

178: An example of network where the greedy algorithm performs flawlessly

179: is shown in Figure \ref{fig:02}.

180:

181: \begin{figure}[!tpb]

182: \centerline{\includegraphics[width=2in,angle=0]{fig2.eps}}

183: \caption{

184: Caption, Removal of a single $(3,1)$ link makes this 3-vertex graph

185: acyclic.

186: }\label{fig:02}

187: \end{figure}

188:

189: Here the link $(3,1)$ carries the maximum score 2. A removal of

190: this link indeed makes the graph acyclic, while a removal of any other than

191: $(3,1)$ link would require a subsequent removal of the second link to achieve

192: the same goal. However, one would

193: suspect that as any ``steepest descent'' method, the proposed greedy

194: algorithm, performing a sometimes near-sighted local one-step optimization,

195: may miss the globally optimal solution. This is indeed often the case for

196: bigger

197: and more complex graphs; a fairly simple example is given in

198: Fig. \ref{fig:03}.

199:

200: \begin{figure}[!tpb]

201: \centerline{\includegraphics[width=3in,angle=0]{fig3.eps}}

202: \caption{Caption, An example of network where the greedy algorithm fails to

203:   determine the optimal solution. The link $(1,2)$ carries the highest score 3

204:   and thus is cut first. However, three 2-node cycles

205:   $\{2,3\}$,  $\{2,4\}$, and   $\{2,5\}$ remain to be eliminated, after which

206:   the number of removed links becomes 4. The optimal solution would

207:   be to cut only three links $(2,3)$, $(2,4)$, and $(2,5)$, each carrying the

208:   score 2. This optimal solution has almost always been found by the annealing

209:   algorithm.

210: }\label{fig:03}

211: \end{figure}

212:

213:

214: \subsection{Simulated annealing network ordering}

215:

216: The task of finding the minimum number of links, cutting which makes the

217: graph acyclic, can be interpreted as an optimization problem and

218: tackled by probabilistic methods such as simulated annealing.

219: Evidently, there exist more than one way to define the optimization function,

220: and after exploring several possibilities we converged to the following one:

221: \begin{itemize}

222: \item For a given network,

223: a set of $M$ levels is introduced ($M\leq N$, in reality, $M\ll N$ and is of

224: the order of the graph diameter).

225: Initially, all nodes are distributed on the levels randomly.

226: \item For a particular distribution of nodes on levels, the

227: number of links

228: that go opposite to the hierarchy, that is, from a lower level

229: to the same or a higher one, is declared to be the  energy $E$ of the

230: distribution,  or the

231: optimization function.

232: \item A node and its new level are selected at random. A difference in energy

233: $\Delta E$

234: that would occur if the node were moved to the new level is calculated. The

235: node is moved to this new level with the probability $\min\{1, \exp (-\Delta

236: E/T\}$, where $T$ is the temperature.

237: \item After the network has been sampled a sufficient number of times (of the

238:   order of $N \times M$), the temperature is reduced by some factor, usually

239:   0.9. Initially, the temperature is set sufficiently high, usually of the

240:   order of the average node degree $L/N$, to allow un-obstructed level

241:   changes.

242: \item When the temperature becomes low enough to inhibit any level changes,

243:   the remaining ascending and in-level links are declared feedbacks and

244:   removed.

245: \item The whole procedure can be repeated several times to check for

246:   consistency in the assignment of feedback links and to determine the

247:   lowest in the number of removed links solution.

248: \end{itemize}

249: A change of level event and the associated energy difference

250: is illustrated in Fig. \ref{fig:03d}

251: \begin{figure}[!tpb]

252: \centerline{\includegraphics[width=3in,angle=0]{fig3d.eps}}

253: \caption{Caption, Node 1 with two incoming and one outcoming link is

254: selected to move from its current position on

255: level $j$ to a new position on level $j+2$. The associate energy difference is

256: $\Delta E = -1 -1 +1 = -1$ where  two -1 contributions come from making

257: $(2,1)$ and $(3,1)$ links hierarchical and the single +1 contribution comes

258: from turning the link $(1,4)$ from hierarchical to non-hierarchical.

259: }\label{fig:03d}

260: \end{figure}

261:

262: A useful property of this algorithm is that in addition to making a network

263: acyclic, it also produces a hierarchical layout. The

264: number of levels $M$ could be fixed by the requirements for such layout.

265: Otherwise, $M$ could be determined self-consistently, by observing when the

266: number of counter-hierarchical links stops to decrease upon the increase in

267: the number of levels. This is illustrated in Fig. \ref{fig:04}

268: where  a plot of the

269: number of non-hierarchical links vs number of levels is presented for the

270: human protein phosphorylation network.

271:

272: \begin{figure}[!tpb]

273: \centerline{\includegraphics[width=3in,angle=0]{fig4.eps}}

274: \caption{Caption, The number of non-hierarchical links vs the number of levels

275: $M$  in the annealing layout of the combined (a union of \citealp{Peri2003}

276: and \citealp{Resnet} datasets) protein phosphorylation network in human cell.

277: The network consists of $L=2880$ links and $N=1297$ nodes (proteins).

278: The nodes with zero in-degree and zero out-degree are always put on the top

279: and bottom levels, correspondingly. The leftmost data point corresponds to the

280: single intermediate level (3 levels total), the number of non-hierarchical

281: links clearly reaches its minimum of 59 links for $M\ge 18$.

282: }\label{fig:04}

283: \end{figure}

284:

285: \section{Discussion}

286:  In the previous section we introduced two algorithms intended to make a

287:  network acyclic by removing the least number of links.  The stochastic

288:  stimulated annealing level-ordering algorithm outperforms the deterministic

289:  greedy algorithm in all respects. Indeed, the greedy algorithm requires

290:  tracking along all paths originating from a given vertex, which uses a lot of

291:  memory  and slows the performance significantly. We found it impractical to

292:  apply the greedy algorithm to networks with more than 100 -- 200

293:  vertices. This rules out its use for all-organism network ordering and limits

294:  its utility to analyzing isolated systems and pathways. In addition, as we

295:  also showed above, it often fails to find the optimum solution, while the

296:  properly executed stimulated annealing always has a certain probability of

297:  converging to it. That said,

298:  there is a grain of biological utility in the ability to

299:  determine how many cycles pass through a given link. Indeed, the demands for

300:  robustness

301:  in evolution of bio-molecular networks may have resulted in a

302:  vast redundancy of pathways sending signals along the dominant

303:  direction of information flow and thus in a relative scarcity of

304:  links going in the opposite direction. Many of these

305:  ``backwards'' links simultaneously close up multiple feedback loops.

306:  The identification of such highly universal feedback links is

307:  facilitated  by the first, cycle counting stage of the greedy

308:  algorithm.

309:

310: Often there exist some {\it a priori} knowledge on the hierarchical

311: positions of certain network nodes. For example, many of the

312: receptor proteins localized in the membrane upon activation pass the signals downstream

313: signaling cascades made of proteins localized in the cytoplasm and ultimately in the cell's

314: nucleus. Thus receptor proteins might have to be forcefully put on the upper levels of the

315: hierarchical layout of such signaling network. Contrary

316: to receptors, many transcription factors serve the role of effectors of

317: signaling pathways and thus must occupy the lowest levels of the hierarchy.

318: Initial, or possibly permanent, position of such nodes on the hierarchical

319: levels often helps to converge to the better in terms of fewer feedback links,

320: or more biologically relevant solution.

321:

322: In a similar way, the orientation of certain links (or equivalently, pairs of

323: nodes) could be quenched if they are known to be of the feed forward of

324: feed back nature. Based on the initial knowledge of network functioning,

325: it is also possible to assign a certain weight to a link, so that the energy

326: $E$ of a particular assignment of nodes to layers is a sum of weights of the

327: counter-hierarchical links. Thus the {\it a priori} known plausibility of a

328: link to be (or not to be) a feedback can be introduced into the layering

329: algorithm.

330:

331: It is also possible to improve the visual perception of the layout by

332: shortening the hierarchical links. In its present edition, a ``good'' or

333: hierarchical link may be arbitrary long, i.e. go down many levels, without

334: carrying any energetic penalty. This interferes with identifying the

335: hierarchical levels as certain stages of network flow. Introduction of a small

336: energetic penalty for particularly long links may alleviate this shortcoming.

337:

338: We leave these questions as well as those of particular  application of ordering

339: algorithms to catalytic signaling and transcription regulation cellular

340: networks for future studies and publications.

341:

342: %\vadjust{\vfill\pagebreak}

343:

344:

345:

346: %% \section{Conclusion}

347:

348:

349: %% \begin{enumerate}

350:

351: %% \item this is item, use enumerate

352:

353: %% \item this is item, use enumerate

354:

355: %% \item this is item, use enumerate

356:

357: %% \end{enumerate}

358:

359:

360: \section*{Acknowledgement}

361: This work was supported by 1 R01 GM068954-01 grant from the NIGMS.

362: Work at Brookhaven National Laboratory was carried out under

363: Contract No. DE-AC02-98CH10886, Division of Material Science, U.S.

364: Department of Energy.

365: II thanks Theory Institute for Strongly Correlated and

366: Complex Systems at BNL for financial support during his

367: visits.

368:

369: \begin{thebibliography}{}

370:

371: \bibitem[Nikitin {\it et~al}., 2003]{Resnet}

372: Nikitin, A., et al (2003) Pathway studio - the analysis and navigation of

373: molecular networks {\it Bioinformatics} {\bf 19}, 1-3.

374:

375: \bibitem[Peri {\it et~al}., 2003]{Peri2003} Peri, S. et al. (2003) Development

376:   of human protein reference database as an initial platform for approaching

377:   systems biology in humans. {\it Genome Research} {\bf 13}, 2363-2371.

378:

379: \end{thebibliography}

380:

381: \end{document}

382: