0302:cond-mat0302296/bi.tex

1: \documentclass[twocolumn,pre,aps,showpacs]{revtex4}

2:

3: \usepackage{dcolumn,graphicx,amsmath,amssymb,pxfonts}

4:

5: \newcommand{\mfr}{M_\mathrm{fr}}

6:

7: \begin{document}

8:

9: \title{On network bipartivity}

10:

11: \author{Petter \surname{Holme}}

12: \email{holme@tp.umu.se}

13: \affiliation{Department of Physics, Ume{\aa} University,

14:   901~87 Ume{\aa}, Sweden}

15:

16: \author{Fredrik \surname{Liljeros}}

17: \affiliation{Department of Epidemiology, Swedish Institute for

18:   Infectious Disease Control, 171~82 Solna, Sweden}

19: \affiliation{Department of Sociology, Stockholm University, 106~91

20:   Stockholm, Sweden}

21:

22: \author{Christofer R.\ \surname{Edling}}

23: \affiliation{Department of Sociology, Stockholm University, 106~91

24:   Stockholm, Sweden}

25:

26: \author{Beom Jun \surname{Kim}}

27: \affiliation{Department of Molecular Science

28:   and Technology, Ajou University, Suwon 442-749, Korea}

29:

30: \begin{abstract}

31:   Systems with two types of agents with a preference for heterophilous

32:   interaction produces networks that are more or less close to

33:   bipartite. We propose two measures quantifying the notion of

34:   bipartivity. The two measures---one well-known and natural, but

35:   computationally intractable; one computationally less complex, but

36:   also less intuitive---are examined on model networks that

37:   continuously interpolates between bipartite graphs and graphs with

38:   many odd circuits. We find that the bipartivity measures increase

39:   as we tune the control parameters of the test networks to

40:   intuitively increase the bipartivity, and thus conclude that the

41:   measures are quite relevant. We also measure and discuss the values

42:   of our bipartivity measures for empirical social networks

43:   (constructed from professional collaborations, Internet communities

44:   and field surveys). Here we find, as expected, that networks arising

45:   from romantic online interaction have high, and professional

46:   collaboration networks have low bipartivity values. In some other

47:   cases, probably due to low average degree of the network, the

48:   bipartivity measures cannot distinguish between romantic and

49:   friendship oriented interaction.

50: \end{abstract}

51:

52: \pacs{89.75.Fb, 89.75.Hc, 05.50.+q}

53: %89.75.Fb Structures and organization in complex systems

54: %89.75.Hc Networks and genealogical trees

55: %05.50.+q Lattice theory and statistics (Ising, Potts, etc.)

56:

57: \maketitle

58:

59: \section{Introduction\label{sec:intro}}

60:

61: Any system, natural or man-made, consisting of entities that interact

62: pairwise can be described in terms of a network. Networks in the real

63: life often contain some degree of randomness, and has also some

64: structure arising from the strategies or laws the entities follow to

65: make new contacts. Such networks---that can only be described as having

66: both randomness and structure---are called complex networks and has

67: lately received much attention in the physicist

68: community~\cite{review,review2}. Among the most important developments

69: in this recent surge of activity in network research is arguably

70: the categorization and quantification of static network structures

71: such as clustering~\cite{WS}, degree distribution~\cite{sf},

72: assortative mixing coefficient~\cite{assmix}, grid

73: coefficient~\cite{grid}, etc. A network with no

74: circuit of odd length is called \textit{bipartite}. Many systems are

75: naturally modeled as bipartite networks: Biochemical networks can be

76: described by vertices representing chemical substances separated by

77: vertices representing chemical reactions~\cite{jeong}. As another

78: example, we have the so called ``two-mode'' representation of

79: affiliation networks where one kind of vertices represents e.g.\

80: organizations and the other type represents individual actors, and the

81: edges indicates to which organizations an actor belongs. But there are

82: also networks that are not necessarily bipartite, but closer to

83: bipartite than what can be expected from a completely random

84: network. Examples of such networks are those that are formed by two

85: types of agents with a preference for heterophilous interaction (human

86: sexual contacts~\cite{liljeros,lea}, and human romance or partnership

87: networks~\cite{partner} being two cases). In many cases one knows the

88: type of the individual vertices (the gender of the actors in the

89: examples above)~\cite{freeman}, but in other cases such information

90: might be lacking (the data studied in Ref.~\cite{HEL} for a concrete

91: example). Nevertheless, the `bipartivity'---how far away from being

92: bipartite a graph is---is a measurable structure; and therefore, we

93: believe, deserves attention.

94:

95: How can we measure bipartivity? The idea we use in this paper is the

96: following: We suppose that all agents of one type tried their best in

97: forming a connection to an agent of the other type. Then we measure to

98: what extent this assumption fail. We can assign a label

99: $\sigma_v\in\{-1,+1\}$ to each vertex $v$ and check for the maximal

100: fraction of edges between vertices of different sign. This fraction

101: will be equal to or higher than the actual fraction of edges between

102: vertices of different type. But, at least for strong heterophilous

103: preference in the network formation, the difference should be

104: small. For weak heterophilous preference this approach will likely

105: fail to produce a correct classification of the individual vertices.

106: Still, the number of even circuits should be larger than in a

107: network created under the same circumstances but with no heterophilous

108: preference; and this will (as we will see) give a lower value of such

109: a bipartivity measure. So even if we cannot reproduce the correct

110: fraction of vertices of different type, we have a measure that is a

111: monotonous function of the strength of the heterophilous

112: preference. It is convenient (at least for people familiar with

113: statistical mechanics) to phrase a problem like this in terms of the

114: antiferromagnetic Ising model. Our bipartivity measure---the maximal

115: fraction of edges between vertices of different sign---is directly

116: related to the ground state energy of the antiferromagnetic Ising

117: model (the relation is given in Sect.~\ref{sec:b1def}). Throughout the

118: paper we will often use the terminology of such spin systems, such as

119: the antiferromagnetic Ising model. For example we talk of an edge

120: between two vertices of the same tag as a `frustrated' edge.

121:

122: The spin system analogy to combinatorial optimization problems such

123: as the one we are facing---to find minimal fraction of frustrated

124: edges---is nothing new. With this approach the fraction of frustrated

125: edges defines a cost function corresponding to the energy of the spin

126: system. The two most studied problems in this area are the

127: $p$-coloring problem and the graph bisection problem. In the

128: $p$-coloring problem the question is whether or not  the vertices of a

129: graph can be assigned one of $p$ colors in such a way that no edge

130: goes between two vertices of the same color. This problem is solvable

131: in linear time for $p=2$, but NP-complete (i.e.\ in the general case

132: not calculable in polynomial time~\cite{hope}) for $p>2$. The graph

133: bisection problem (also NP-complete) is to partition the vertex-set

134: into two sets of equal size such that the number of edges between the

135: two sets is minimized~\cite{jerrum,schreiber,FA}. Both these problems

136: can, just as ours, be phrased in terms of spin-models with

137: antiferromagnetic interaction. Our minimization problem is a little

138: bit different from the bisection problem in that the two sections can

139: have arbitrary sizes. However, as in the bisection and $p$-coloring

140: problems we are also faced with an NP-complete optimization

141: problem. (Our aim---to find the ground state energy of

142: antiferromagnetic Ising model can be mapped to a min-flow max-cut

143: problem~\cite{alava} which is NP-hard on general

144: networks~\cite{karp}.)

145:

146: As the spin models of statistical physics are familiar to statistical

147: physicists, it is not surprising that topics like the Ising and

148: \textsl{XY} models on various model networks~\cite{bw,spin} have received

149: much attention in physicists' network literature. The motivation for

150: such studies, as models of real-world systems, is that they can capture

151: some features of opinion formation or similar social

152: processes~\cite{socstatmech}. The present work can also be described

153: as a study of a spin model on a complex network, but unlike the above

154: mentioned studies, the spin model is used as a tool to measure a

155: static network structure.

156:

157: \section{The measures}

158: In the following sections we will go through the two bipartivity

159: measures. We state the definitions, dissect the algorithms and give

160: analytic discussions about the limit properties.

161:

162: We represent a undirected network by $G=(V,E)$ and a directed network by

163: $G_\mathrm{dir}=(V,A)$, where $V$ is the set of vertices, $E$ is a set

164: of edges (or undirected pairs of vertices), and $A$ is a set of arcs

165: (or ordered pairs of vertices). A \textit{path} of length $l$ is a

166: sequence of vertices $v_1,\cdots,v_l$ such that $(v_i,v_{i+1})\in E$

167: (or $(v_i,v_{i+1})\in A$ for directed graphs); a \textit{circuit} is a

168: path where the first and last vertex are identical. In an

169: \textit{elementary} path, or circuit, no vertex appears twice (except

170: the first and last in case of circuits). In the present paper we will

171: only talk about elementary paths and circuits---so, for brevity we omit

172: the word `elementary.' Throughout the paper, when necessary, we let

173: sub- or superscript `dir' denote directed versions of quantities. In

174: many cases the generalization from undirected to directed networks is

175: straightforward; in these cases we will pursue the discussion in the

176: framework of undirected networks.

177:

178: \subsection{The measure $b_1$}

179: \subsubsection{Definition\label{sec:b1def}}

180: The first measure we consider is simply the fraction of unfrustrated

181: edges in the ground state of the antiferromagnetic Ising model on the

182: network. In terms of the antiferromagnetic Ising model the quantity

183: can be written as

184: \begin{equation}

185:   b_1 = 1-\frac{\mfr}{M}=\frac{1}{2}-\frac{E_0}{2M}~,\label{eq:b1}

186: \end{equation}

187: where $\mfr$ is the number of frustrated edges in the ground state

188: (the usual cost function in the two-coloring problem).

189: $E_0$ is the ground state energy

190: \begin{equation}

191:   E_0=\min_{\{\sigma_v\}}~, H\label{eq:e0}

192: \end{equation}

193: where $H$ is the Hamiltonian of the antiferromagnetic Ising model:

194: \begin{subequations}

195: \begin{eqnarray}

196: H&=&\sum_{(v,w)\in  E} \sigma_v\sigma_w\\

197: H_\mathrm{dir}&=&\sum_{(v,w)\in  A} \sigma_v\sigma_w

198: \end{eqnarray}

199: \end{subequations}

200: The directed quantity is obtained by substituting $H$ by

201: $H_\mathrm{dir}$ in Eqs.~(\ref{eq:b1}) and (\ref{eq:e0}), and edges by

202: arcs in the above discussion. The topology of the energy landscape is

203: determined by the underlying network, and can in general be very

204: complex~\cite{barahona}.

205:

206: \subsubsection{Limit properties}

207:

208: The $b_1$ measure takes values in the interval $(1/2,1]$. The upper

209: bound is attained for bipartite graphs. It is easy to see that $b_1$

210: cannot be lower than $1/2$: Consider a ground state configuration for

211: which the opposite is true. Then there must be at least one vertex

212: with more than half of its edges frustrated. Flipping this spin

213: would reduce the energy, which contradicts the fact that the system is

214: in the ground state. We do not know if this bound is realized for any

215: finite graphs, but $b_1=1/2$ is the limit value for $b_1$ for a fully

216: connected graph as $N\rightarrow\infty$: Partition the fully connected

217: graph $K_N$ of $N$ vertices (and $M=N(N-1)/2$ edges) into one set of

218: $N'$ and one set of $N-N'$ vertices and assign opposite spins to the

219: elements of these sets. The number of frustrated edges is precisely

220: the number of edges within each set which is:

221: \begin{eqnarray}

222:   \mfr(K_N)&=&\frac{N'(N'-1)}{2}+\frac{(N-N')(N-1-N')}{2}

223:   \nonumber\\&=&M-N'(N-N')~.

224: \end{eqnarray}

225: Thus the minimum number of frustrated edges is exactly $N^2/4-N/2$ for

226: $N'=N/2$, and the fraction of unfrustrated edges is

227: \begin{equation}

228:   b_1 = \frac{1}{2-2/N}\rightarrow\frac{1}{2}

229:   \mbox{~as~} N\rightarrow\infty~.

230: \end{equation}

231: The above arguments can be generalized to directed networks

232: straightforwardly.

233:

234: \subsubsection{Minimization by exchange Monte Carlo}

235: The complexity of the ``energy landscape'' of the antiferromagnetic

236: Ising model on an arbitrary network is difficult to judge \textit{a

237:   priori}. There are indications that no natural network would be

238: too hard for a regular simulated annealing

239: approach~\cite{simann,jerrum}. To be on safer ground, we use a Monte Carlo

240: scheme that is evidently very efficient to sweep even an extremely

241: `rugged' energy landscape without getting stuck in local minima---the

242: so called exchange Monte Carlo (XMC)~\cite{xmc}. The idea of exchange

243: Monte Carlo is to run standard Metropolis Monte Carlo for $N_T$

244: replicas of the system, each at a specific temperature. Then from time

245: to time two replicas at adjacent temperatures are compared, and with a

246: probability\begin{equation}

247:   P_\mathrm{exch.}=\left\{\begin{array}{ll} 1 & \mbox{if $\Delta<0$}\\

248:       e^{-\Delta} & \mbox{otherwise}\end{array}\right.~,

249: \end{equation}

250: where

251: \begin{equation}

252:    \Delta=\left(\frac{1}{T}-\frac{1}{T'}\right)(E'-E)~,

253: \end{equation}

254: and $E$ is the energy of the configuration at temperature $T$

255: (similarly for $T'$ and $E'$), and $T<T'$.

256: the two replicas are swapped between the temperatures. This condition

257: is designed so that the Monte Carlo scheme preserves the Boltzmann

258: distribution. This is not decisive for us who are looking for the

259: ground state energy, rather that performing a proper sampling of the

260: configuration space, but anyway kept in our measurements. Besides just

261: running the XMC scheme we also periodically quench the system,

262: i.e.\ we sweep through all vertices of the network consecutively and flip

263: spins that lower the energy. The sweeps are continued until a sweep

264: with no spin-flips has occurred. For later reference we introduce the

265: notations $t_\mathrm{avg}$ for the total number of MC sweeps---we

266: refer to the number of MC sweeps as `time'---$t_\mathrm{quench}$ for

267: the time between each quench, $t_\mathrm{exch}$ for the time between

268: exchange trials, $t_\mathrm{measure}$ for the time between measurement

269: sweeps (where the energy is sampled).

270:

271: For the exchange Monte Carlo scheme to efficiently sample the

272: configuration space all replicas needs to tour the whole range of

273: temperatures in a reasonably short time. At the same time one would

274: not like the exchange trials, at any neighboring temperatures, to be

275: constantly affirmative---then the separation of the two temperatures

276: would be of no use. We follow Ref.~\cite{xmc} and choose the

277: temperature set

278: \begin{equation}

279:   T_i=T_\mathrm{low}\left(

280:   \frac{T_\mathrm{high}}{T_\mathrm{low}}

281:   \right)^{(i-1)/(N_T-1)}~,

282: \end{equation}

283: where $1\leq i\leq N_T$ enumerates the replicas. $T_\mathrm{low}$ is

284: the lowest and $T_\mathrm{high}$ represent the highest temperatures

285: respectively. To find the actual parameter values (which will be

286: stated in Secs.\ \ref{sec:mod2} and \ref{sec:real_res}) one has to

287: check that the replicas travels throughout the temperature range with

288: reasonable exchange ratios for all temperature gaps.

289:

290: \begin{figure}

291:   \centering{\resizebox*{8cm}{!}{\includegraphics{ex.eps}}}

292:   \caption{Some graphs in the discussion of the $b_2$ quantity. The

293:     coloring of the vertices minimizes $\mfr$. Black edges indicate

294:     frustration. (a) An almost bipartite graph with many

295:     triangles. (b) A graph where all odd-circuits contribute to the

296:     frustration. (c) A graph were only the shortest circuits

297:     contribute to the frustration.}

298:   \label{fig:ex}

299: \end{figure}

300:

301: \subsection{The measure $b_2$}

302: Apart from finding an approximative value of $b_1$, one can also

303: define a quantity that is exactly solvable in polynomial time. Our

304: intention is in the first hand not to make a heuristic algorithm for

305: calculating $b_1$, but rather a quantity that captures the same

306: structure, i.e.\ that grows monotonously with $b_1$.

307:

308: That a graph contains no odd circuits is the defining property of

309: bipartiteness~\cite{intro}. It is thus natural that we base a

310: bipartivity measure on an odd-circuit count in some

311: way. Unfortunately, defining a quantity in this way becomes a little

312: bit more complicated than at first expected. One complication is

313: that a graph can be very close to bipartite and still contain many

314: odd-circuits (see Fig.~\ref{fig:ex}(a)). A way of dealing with this

315: problem is to mark as few edges as possible such that each odd circuit

316: contains at least one marked edge. In many cases a marked edge will

317: correspond to a frustrated edge of the ground state of the

318: antiferromagnetic Ising model. In Fig.~\ref{fig:ex}(a) only the upper,

319: horizontal edge needs to be marked. Another problem one faces is

320: how to deal with odd circuits of different length---in a network with

321: very few odd circuits a circuit of, say, length seven would contribute

322: as much to the global frustration of the network as a triangle (a

323: subgraph of three adjacent vertices---see

324: Fig.~\ref{fig:ex}(b)). But in many real networks the total length of

325: the odd circuits is very long (this is true for all networks we

326: measure, see Sect.~\ref{sec:rwn}), much larger than $M$, in these cases

327: the short circuits are in general the most important in determining

328: the ground state configuration. For example, in Fig.~\ref{fig:ex}(c)

329: $M=23$, and while we have 11 triangles, summing the lengths of all odd

330: circuits gives 218 (33 from the 11 triangles, 45 from the nine circuits of

331: length five, and so on). However, only the triangles contributes to the

332: ground state configuration in the sense that each triangle has the

333: same configuration as the ground state of an isolated triangle, while

334: all odd circuits of length larger than four (e.g.\ the periphery) has not

335: the best coloring for a circulant of that length. To deal with this we

336: need to weight short circuits higher than long. We will do this by

337: assigning a cut-off length and neglect all circuits exceeding this

338: length.

339:

340: \subsubsection{Definition\label{sec:def}}

341:

342: Now, we make an algorithm of the above ideas as follows: Let $C_n$ be

343: the set of odd circuits of length $\leq n$. Let $\Sigma(C_n)$ be the

344: accumulated length of the circuits in $C_n$ (so, for example

345: $\Sigma(C_3)=3$ in Fig.~\ref{fig:ex}(b)). Now we assign the cut-off

346: $3M$ to $\Sigma(C_n)$, and let $\hat{n}$ be the smallest $n$

347: such that $\Sigma(C_n)\geq 3M$. Next we turn to the marking procedure

348: sketched above. Let $\nu(e)$ denote the number of circuits in

349: $C_{\hat{n}}$ passing through the edge $e$. Clearly edges of

350: high $\nu$ are likely to be frustrated in the ground state

351: (viz.\ Fig.~\ref{fig:ex}(a)). We now estimate $\mfr$ roughly

352: as the number of edges that has to be marked so that each odd circuit

353: of length $\leq\hat{n}$ is marked at least once. To be precise we

354: perform the following algorithm:

355: \begin{enumerate}

356: \item Start with $C=C_{\hat{n}}$.

357: \item Sort the edges in order of $\nu$.\label{step:0}

358: \item Repeat the following while $C\neq\varnothing$:\label{step:1}

359: \begin{enumerate}

360: \item \label{step:a} Mark the edge $e$ with highest $\nu$.

361: \item \label{step:b} Remove all circuits in $C$ containing $e$.

362: \item \label{step:c} Recalculate $\nu$ for each edge.

363: \end{enumerate}

364: \end{enumerate}

365: Then the number of iterations $m'$ is the assessment of $\mfr$, and we

366: define our bipartivity measure as

367: \begin{equation}

368:   b_2=1-\frac{m'}{M}~.

369: \end{equation}

370:

371: This algorithm is not an attempt to actually identify the frustrated

372: edges, rather it is supposed to give a high $\mfr$ for a system with

373: high (total) geometric frustration, and vice versa: Firstly, it does

374: not necessarily find the minimal number of edges needed to be marked

375: for all odd circuits of length less than $\hat{n}$ to contain a marked

376: edge. But we expect this steepest descent optimization to come close

377: in most cases. Secondly, an odd circuit can in reality only have an odd

378: number of frustrated edges, but in the algorithm there is no such

379: restriction on the number of marked edges.

380:

381: In case there are more than one edge with the highest $\nu$ (in step

382: \ref{step:a} of the algorithm) we choose the edge to mark at

383: random. The variance between different random seeds turns out to be

384: negligible in most cases. We will run the algorithm for different seeds to

385: choose the highest $b_2$ value, and get an idea about the error in

386: $b_2$ from the selection of edge to mark. An alternative (and more

387: ambitious) approach would be to iterate the whole calculation until

388: the highest $b_2$ has reappeared a fixed number of times (cf.\

389: \cite{ww}).

390:

391: If we assume a sparse network (i.e.\ $N\propto M$) the running time of

392: the algorithm above is $O(M^2)$. To see this we first note that there

393: can be at most $O(M)$ iterations at step~\ref{step:1}. To find the

394: edge with highest $\nu$ (in step~\ref{step:a}) we do not need to sort

395: all edges more than once (as done in step~\ref{step:0}). Instead we

396: can find this out while recalculating $\nu$ (in

397: step~\ref{step:c}). Removing all circuits containing $e$ (as in

398: step~\ref{step:b}) can be done in time bounded by the total length of

399: circuits containing $e$, which cannot be larger than

400: $3M$. Step~\ref{step:c} also needs to go through all circuits passing

401: $e$ and thus needs the same running time as step~\ref{step:b}. To sum

402: this up the running time for this section of the algorithm is of order

403: $N^2$.

404:

405: \subsubsection{Limit properties}

406:

407: In the $N\rightarrow\infty$ limit the $b_2$ measure lies in almost the

408: same interval as $b_1$. The upper limit $b_2=1$ is attained if and

409: only if the graph is bipartite. (If the graph is bipartite

410: $C_{\hat{n}}$ is empty and $\nu(a)=0$ for all $a$, so $m'=0$ and

411: $b_2=1$. If there exists odd circuits $m'\geq 0$, so $b_2<1$.) $b_2$

412: cannot be as low as 0 (if one marks all edges, every circuit must be

413: marked). Since the $b_2$-definition is inspired by the ground-state

414: configuration of the antiferromagnetic Ising model, we expect a

415: similar lower bound to $b_2$ as to $b_1$. In Appendix~\ref{sec:bound}

416: we argue that the lower bound on the $b_2$, as for the $b_1$ measure,

417: is $1/2$ in the $N \rightarrow \infty$ limit.

418:

419:

420: \subsubsection{The complete algorithm}

421: So far we have overlooked the central part in calculating the $b_2$

422: measure---namely to find odd circuits. To do this we use a modified

423: version of Johnson's algorithm~\cite{johnson}. In principle Johnson's

424: algorithm is a depth first search where, to avoid futile searching,

425: some vertices are blocked while stepping down the search tree. The

426: running time for Johnson's algorithm is $O(M(C+1))$ (if $M>N$) where

427: $C$ is the total number of circuits. Now $C$ can grow fast with

428: $N$ which would make the finding of all odd circuits a quite

429: intractable computation. In many cases the cut-off of the circuit length,

430: that we introduced above to give less priority to long circuits, saves

431: us by setting a limit on the search depth. To implement this we

432: let $\bar{n}$ be the current upper bound on circuit length (or search

433: depth), and $\bar{\Sigma}$ be the current sum of odd circuits

434: $\leq\bar{n}$. As soon as $\bar{\Sigma}\geq M$ we iteratively

435: decrease $\bar{n}$ by $2$ and recalculate $\bar{\Sigma}$ until

436: $\bar{\Sigma}<3$. If $\bar{\Sigma}< M$ when the search is over we

437: rerun the procedure where we use $\bar{n}+2$ as our new (fixed)

438: $\bar{n}$~\cite{note:alt}. When the search is over we assign

439: $\hat{n}$ the value $\bar{n}$. For dense bipartite graphs the

440: algorithm is intractable. In the worst case, the full bipartite

441: graph, $K_{N/2,N/2}$, there are

442: \begin{equation}

443:   C(K_{N/2,N/2})= \sum_{k=4}^{N}\frac{1}{2k}

444:   \left[\frac{(N/2)!}{(N/2-k/2)!}\right]^2

445: \end{equation}

446: circuits (where the sum is over even values of $k$)~\cite{note:bip}

447: giving a running time of $O(N^2C(K_{N/2,N/2}))$. One can of course

448: decide whether or not a graph is bipartite in linear time, but

449: non-bipartite cases of similar complexity are easily constructed (by,

450: e.g., adding an isolated triangle). In practice these worst cases are,

451: probably, very rare---a, relatively speaking, very low density of odd

452: circuits is needed to get a small $\hat{n}$---even in the real-world

453: network with highest bipartivity we have $\hat{n}=3$. In this case

454: ($\hat{n}=3$) all odd circuits are found in $O(M^2)$ time.

455:

456: Now we turn to a more complete description of the algorithm. Johnson's

457: algorithm takes the `least' (smallest in some enumeration) vertex in a

458: strongly connected subgraph as its starting point. To find strongly

459: connected components we use the algorithm in Ref.~\cite{SCC}. To sum

460: up, the algorithm reads:

461: \begin{enumerate}

462: \item Mark all vertices as unchecked.

463: \item While there are unchecked vertices, iterate the

464:   following:\label{step:wh}

465: \begin{enumerate}

466: \item Pick an unchecked vertex $v$.

467: \item Find the largest strongly connected component $\Lambda_v$

468:   containing $v$.

469: \item Set $\Lambda:=\Lambda_v$ and repeat the following steps as long

470:   as $\Lambda\neq\varnothing$:

471: \begin{enumerate}

472: \item Pick the least vertex $u$ of $\Lambda$.

473: \item Call a subroutine implementing the modified Johnson's

474:   algorithm. Recalculate $\bar{n}$ and add $C_{\bar{n}}$ to a list

475:   $\mathcal{C}$. Delete circuits longer than $\bar{n}$ from $\mathcal{C}$.

476: \item Delete $u$ from $\Lambda$.

477: \end{enumerate}

478: \end{enumerate}

479: \item Set $\hat{n}:=\bar{n}$.

480: \item Run the algorithm described above (in Sect.~\ref{sec:def}) to

481:   mark edges and calculate $b_2$.\label{step:calc}

482: \end{enumerate}

483: In all cases, step~\ref{step:wh} sets the limit on running time. As

484: mentioned, in most application we expect the running time of

485: step~\ref{step:wh} to be $O(M^2)$ (similarly to that of

486: step~\ref{step:calc}).

487:

488: \section{The Networks}

489:

490: \begin{figure}

491:   \centering{\resizebox*{8cm}{!}{\includegraphics{mo.eps}}}

492:   \caption{Construction of the test networks. (a) shows the

493:     generalization of the ER model (Model 1). (b) shows interpolation

494:     between quadratic and triangular lattices (Model 2). (c) shows the

495:     model with predominantly longer circuits (Model 3). All models

496:     are bipartite for $r_{1,2,3}=1$. Additional edges creates

497:     odd circuits (frustration) for lower $r_{1,2,3}$-values. The black

498:     lines illustrates these additional edges. The white and non-white

499:     vertices symbolize a partition giving $b_1=1$ in the $r_{1,2,3}=1$

500:     case (it is not meant to represent the optimal coloring when

501:     $r_{1,2,3}<1$).

502:   }

503:   \label{fig:mo}

504: \end{figure}

505:

506: \subsection{Test networks with tunable bipartivity\label{sec:mod}}

507: To test and compare the $b_1$ and $b_2$ quantities we construct three

508: types of test networks where the bipartivity can be tuned by model

509: parameters. The principle behind all models is to start from

510: bipartite networks and add lesser or greater number of edges within a

511: partition to create odd circuits.

512:

513: One type (Model 1) is a quite straightforward generalization  of the

514: Erd\"{o}s-Renyi (ER) model~\cite{ER}: We partition the vertices in two

515: disjoint sets of sizes $\tilde{N}$ and $N-\tilde{N}$. Then we add $r_1

516: M$ edges randomly between vertices of the different sets, and

517: $(1-r_1)M$ edges regardless of what set the vertices belongs to (see

518: Fig.~\ref{fig:mo}(a)). In this way we interpret $r_1$ as the

519: strength of the heterophilous preference in a model where bipartivity

520: is the only structural bias. The choice of vertex pairs

521: is done with randomness, the only restriction being that loops and

522: multiple edges are not allowed. If $r_1=0$ the model reduces to the ER

523: model, while for $r_1=1$ the networks are bipartite (cf.\

524: Ref.~\cite{nws}).  This model is probably the most random (i.e.\

525: having least structural biases) model with tunable bipartivity. The

526: disadvantage is that the expectation values of $b_1$ and $b_2$ are

527: hard to calculate (even in the frustrated limit $r_1=0$).

528:

529: Model 2 interpolates between two-dimensional square- and triangular

530: lattices. We start, for $r_2=0$, with a triangular grid with periodic

531: boundary condition. Let $L$, the linear dimension of the system (i.e.\

532: $N=L^2$), be even. For a non-zero parameter value we (by uniform

533: randomness) delete $r_1L^2$ `diagonal' edges creating frustration as

534: illustrated in Fig.~\ref{fig:mo}(b). To be more precise, if we index

535: the vertices as $(i_x,i_y)$, $1\leq i_x,i_y\leq L$; then the edges are

536: $[(i_x,i_y),(i_x+1,i_y)]$ and $[(i_x,i_y),(i_x,i_y+1)]$ (giving the

537: square grid) plus $r_1L^2$ edges of the form

538: $[(i_x,i_y+1),(i_x+1,i_y)]$ chosen by uniform randomness (addition is

539: modulo $L$). This model has a high degree of short circuits. The

540: extremes $r_2=0$ and $r_1=1$ represent two generic lattice types. The

541: symmetries of the regular networks simplify the calculations of

542: e.g.\ limit properties for the bipartivity measures. If $r_2=1$ the

543: system is bipartite (note that $L$ has to be even for this to hold) so

544: $b_{1,2}=1$. When $r_2=0$ we have $b_1=b_2=2/3$:  For the lower limit

545: of the $b_1$ quantity, see Ref.~\cite{WAHO}. For the lower limit $b_2$

546: we note that $\Sigma(C_3)=6N$ (since each vertex can be associated

547: with two triangles). This gives $\hat{n}=3$ and $\nu=2$ for all

548: edges. Now it is enough to mark $N$ edges (e.g.\ all

549: $[(i_x,i_y),(i_x+1,i_y)]$ edges). In this case we note that each edge

550: will have $\nu=2$ when it is marked, which means that the marking

551: sequence is optimal and that the number of iteration cannot be less

552: with another choice of edges to mark. So $b_2=1-N/3N=2/3$. The major

553: disadvantage with Model 2 is that the average degree is a function of

554: $r_2$ ($M=(3-r_2)L^2$). This change in the average degree can make it

555: harder to separate effects of the shift in bipartivity from the shift

556: in average degree.

557:

558: In both model 1 and (even more) model 2 triangles will dominate

559: the set of odd circuits. To test networks with predominantly longer

560: circuits we construct a Model 3 as follows (see Fig.~\ref{fig:mo}(c)):

561: We make two circulants of size $N/2$ with the vertices

562: $\{v_1^i,\cdots,v_{N/2}^i\}$ and edges

563: $\{(v_1^i,v_2^i),\cdots,(v_{N/2-1}^i,v_{N/2}^i), (v_{N/2}^i,v_1^i)\}$,

564: $i\in\{1,2\}$. Then we add $M_\mathrm{trans}$ transverse edges between

565: the circulants. $M_\mathrm{trans}/2$ of these edges are placed out

566: separated by equal distance $N/M_\mathrm{trans}$ separating the double

567: circulants into $M_\mathrm{trans}/2$ `sectors.' Then we fill up each

568: sector with another transverse edge: With probability $r_3$ we add an

569: $(v^1_i,v^2_i)$ edge (such that $(v^1_i,v^2_i)$ is none of the

570: previously added transverse edges), otherwise we add a

571: $(v^1_i,v^2_i+1)$ edge (addition modulo $N/2$). We note, to a first

572: approximation, that if $r_3=0$ marking (in the process of calculating

573: $b_2$) one edge between every transverse edge on one of the circulants

574: is needed to mark the shortest odd circuits. This will make $b_2\in

575: O(1-M_\mathrm{trans}/N)$.

576:

577: \subsection{Real-world networks\label{sec:rwn}}

578: Physicists' networks studies has, in the spirit of statistical mechanics,

579: emphasized the properties remaining when the system grows beyond any

580: limit. Bipartivity, as discussed above, is well defined for all system

581: sizes. Still it is a quantity that can potentially suffer from

582: finite-size effects (from the fact that not all real neighbors of all

583: actors in a empirically constructed social network are a part of the

584: graph) and is therefore preferably measured for large networks. Now

585: the problem is to find data for large-scale real-world networks of

586: social interaction. In general two methods has been successful for this

587: purpose---one either uses professional collaborations of some sort or

588: data from interaction over the Internet (either in Internet

589: communities~\cite{HEL,smith}, or through email exchange~\cite{ebel}.

590:

591: \subsubsection{Professional collaboration networks}

592: In the professional collaboration networks we study the

593: vertices are professionals of some field---networks of scientists

594: and company directors are considered in this papers, the movie-actor

595: network is another frequently studied example; the edges represent

596: that two actors has been involved in the same professional

597: collaboration. This is some-times referred to as a ``one-mode''

598: representation of an affiliation network (as opposed to the bipartite

599: two-mode representation discussed in Sect.~\ref{sec:intro}).

600:

601: Professional collaboration networks are no doubt interesting in their

602: own right as accounts for the interaction dynamics of the respective

603: fields. Assuming that the formation of professional ties follow

604: similar principles as general human interaction, we can use

605: professional collaboration networks to draw conclusions about the

606: structure of more general social networks. However, at one point (at

607: least) professional collaboration differs from general social

608: interactions: A collaboration tie does not necessarily imply a strong

609: personal acquaintance, but in these networks each collaboration

610: constitutes a fully connected cluster. This leads to higher fraction

611: of short circuits than, say, a friendship network.

612:

613: One of the professional collaboration network we use is of

614: scientists who has uploaded manuscripts to the preprint repository

615: arxiv.org. Two scientists are linked if their name (identified by

616: surname and initials) appear together on at least one preprint. A

617: detailed description of this network can be found in

618: Ref.~\cite{newman3}. In the other professional collaboration network

619: the vertices represent company directors from the Fortune top 1000

620: list of companies in USA the year 2001. An edge (collaboration) in

621: this network means that two directors are sitting in board of the same

622: company. A detailed description of this network can be found in

623: Ref~\cite{davis}. Sizes of the networks can be seen in

624: Table~\ref{tab:b}.

625:

626: \subsubsection{Online interaction networks}

627:

628: In online interaction networks, the vertices are users of Internet

629: communities and an arc (A,B) is added if A contacts B, or

630: if A adds B to his/her list of friends~\cite{smith,HEL}. Another kind

631: of online interaction networks are email networks~\cite{ebel}, where

632: an arc can be assigned if an email is sent, or if a person adds

633: another to his/her address book. Just as for professional collaboration

634: networks, one can argue that online interaction networks are

635: representative as general social networks. One can assume that new

636: contacts are formed through preference-matching searches to a larger

637: extent, and introduction by mutual friends to a lesser extent, than in

638: general friendship networks. Since the introduction of mutual friends

639: to each other is believed to be the major cause of high clustering

640: (large density of triangles, or, large transitivity)~\cite{newman1}

641: one can expect a lower clustering in networks of online interaction

642: (still the clustering in these network seems to be finite in the

643: $N\rightarrow\infty$ limit~\cite{HEL}).

644:

645: The specific online interaction networks we consider are constructed

646: from the Internet communities nioki.com and pussokram.com. The

647: nioki.com data is described in Ref.~\cite{smith}. In this data an arc

648: (A,B) means that B is listed as a friend by A, which

649: allows A to see if B is online and send instant messages to B. In

650: the pussokram.com data the arcs correspond to communication between

651: the users. There are four different types of communication in this

652: specific network (all described in detail in Ref.~\cite{HEL}). We use

653: the networks obtained from two types of interaction (`messages'---like

654: ordinary emails within the community, and ``guest book''---where one user

655: contacts another by writing in his/her guest book), and the network

656: of any of the four types. Network sizes can be found at

657: Table~\ref{tab:b}.

658:

659: Another large difference between the pussokram.com and nioki.com data

660: is that the former community has a very pronounced romantic profile,

661: encouraging flirts and romantic correspondence. nioki.com has also a

662: search engine to ``trouve l'amour'' (find love), but that is all.

663:

664: Apart from the two Internet communities, we study another type of

665: online interaction network based on the flow email. For this network

666: all in- and out-going email traffic to a server was logged for around

667: three months~\cite{ebel}. The server handles undergraduate students'

668: email accounts at Kiel University, Germany. Thus there are two

669: categories of vertices---internal vertices, whose activity is accurately

670: mapped; and external vertices, that only have edges leading to internal

671: vertices. In this study we restrict ourselves to the network of

672: internal-internal contacts. The reason we do not include external

673: contacts is that we would miss the (probably many) circuits containing

674: external-external edges which would bias the bipartivity.

675:

676: \subsubsection{Network from interview and field survey\label{sec:soc}}

677: Apart from the above networks, all obtained from databases, we also

678: measure the bipartivity of two networks obtained from interview and

679: field surveys. The first data set is gathered by observations of

680: interaction between members of a university karate

681: club~\cite{zach}. We also study the network of acquaintance ties in a

682: prison~\cite{prison}. The outgoing arcs from A corresponds to

683: prisoners listed by A in response to the question: ``What fellows on

684: the tier are you closest friends with?'' Due to their acquisition

685: methods these kind of real-world networks has to be rather small. This

686: can, as mentioned, result in finite size effects. On the other hand

687: they, most likely, more truly reflect the structure of real

688: acquaintance networks.

689:

690: \section{Results}

691:

692: In this section we present the results of the test networks and the

693: measurement for the real-world social networks.

694:

695: \subsection{Test networks\label{sec:mod2}}

696:

697: As expected, both $b_1$ and $b_2$ are monotonously increasing as

698: functions of the $r_1$, $r_2$ and $r_3$ parameters of

699: (almost~\cite{note:not_really} all our test network (see

700:   Fig.~\ref{fig:mx})). This is encouraging and suggests that both

701: $b_1$ and $b_2$ are quite relevant measures of bipartivity.

702:

703: The Model 1 measurements shown in Fig.~\ref{fig:mx}(a) are made with the

704: model parameters $N=2\tilde{N}=100$ and $M=800$. We have checked many

705: other sizes too, but all have the characteristic appearance of

706: Fig.~\ref{fig:mx}(a)---a linear increase of $b_1$ and $b_2$ for larger

707: $r_1$ and an flatter slope for $r_1$ close to zero. This shape is

708: expected from the discussion in Sect.~\ref{sec:intro}---in networks

709: where a heterophilous preference is the only structure-inducing

710: force, only the strong preference limit gives a strong measurable

711: effect: Close to the ER limit $r_1\approx 0$, the original two

712: partitions will not be identified correctly, only when the different

713: partition (to a large extent) have different sign the bipartivity will

714: be proportional to the strength of the heterophilous preference.

715:

716: As seen in Fig.~\ref{fig:mx}(b) Model 2 shows an almost linear

717: functional form of $b_{1,2}(r_2)$. In this case, triangles dominate

718: the odd circuits even at small values of $r_2$. Tuning $r_2$ will give

719: a proportional increase of the number of triangles. Thus a linear

720: $r_2$ dependence of $b_2$ would be expected.

721:

722: Also Model 3 has linear $b_{1,2}$ vs.\ $r_3$ curves. The model

723: parameters used are $N=100$ and $M_\mathrm{trans}=10$. As mentioned in

724: Section~\ref{sec:mod}, we expect $b_2\approx M_\mathrm{trans}/N$ for

725: $r_3=0$, which is confirmed in Fig.~\ref{fig:mx}(c).

726:

727: The measurements for both $b_1$ and $b_2$ are averaged over 100

728: network realizations. The XMC scheme for the $b_1$ quantity is ran at

729: 24 temperatures in parallel, between temperatures $0.01$ and

730: $2$. Other network parameters are $t_\mathrm{avg}=4\times 10^5$,

731: $t_\mathrm{measure}=4$, $t_\mathrm{quench}=20$ and

732: $t_\mathrm{exch}=1000$. These are more modest parameter values than we

733: will use for the real-world networks, but the test networks are also

734: much smaller, and since the distribution of $b_1$ and $b_2$ are

735: (likely) symmetric, the network average helps to reduce the error.

736:

737: \begin{figure}

738:   \centering{\resizebox*{8cm}{!}{\includegraphics{mx.eps}}}

739:   \caption{The bipartivity measures versus the model parameters of the

740:   two models defined in Section~\protect\ref{sec:mod}. (a) shows the

741:   result for Model 1, (b) shows the result for Model 2, and (c) shows

742:   the result for Model 3. All error bars would be smaller than the

743:   symbol size. The monotonous growth of the bipartivity measures shows

744:   that the measures behaves expectedly.}

745:   \label{fig:mx}

746: \end{figure}

747:

748: \begin{table*}

749:   \caption{Sizes, clustering coefficients and bipartivity measures

750:   $b_1$ and $b_2$ for real-world social networks.}

751: \label{tab:b}

752: \begin{ruledtabular}

753: \begin{tabular}{l|rrr|dddd|dddd}

754:   network & $N$ & $M_\mathrm{dir}$ & $M$ & C_\mathrm{dir} & C &

755:   D_\mathrm{dir} & D & b_1^\mathrm{dir} & b_1 &

756:   b_2^\mathrm{dir} & b_2 \\\hline

757:   all contacts & $29{\:}341$ & $174{\:}662$ & $115{\:}684$& 0.012 &

758:   0.0060 & 0.016 & 0.017 & 0.859 & 0.860 & 0.948 & 0.928\\

759:   messages & $20{\:}691$ & $73{\:}346$ & $52{\:}435$& 0.0052& 0.0061 &

760:   0.0081 & 0.0061  & 0.897 & 0.892 & 0.984 & 0.964\\

761:   guestbook & $21{\:}545$ & $76{\:}257$ & $55{\:}076$& 0.014 & 0.014 &

762:   0.015 & 0.021  & 0.863 & 0.889 & 0.943 & 0.965\\

763:   nioki.com & $50{\:}259$ & $405{\:}742$ & $239{\:}452$& 0.0076 &

764:   0.0065 & 0.016 & 0.013  & 0.842 & 0.855 & 0.956 & 0.975\\

765:   emails & 637 & 554 & 443& 0.11 & 0.16 & 0.071 & 0.14 & 0.944 & 0.944

766:   & 0.971 & 0.941 \\

767:   arxiv.org & $52{\:}909$ & $\times$ & 490{\:}600 & \times & 0.45 &

768:   \times & 0.35 & \times & 0.630 & \times & 0.623\\

769:   directors & 7${\:}475$ & $\times$ & 48{\:}899 & \times & 0.21 &

770:   \times & 0.37 & \times & 0.549 & \times & 0.507\\

771:   karate club & 34 & $\times$ & 78& \times & 0.26 & \times & 0.26  &

772:   \times & 0.782 & \times & 0.782 \\

773:   prison & 64 & 182 & 85& 0.19 & 0.31 & 0.089 & 0.14 & 0.786 & 0.878 &

774:   0.918 & 0.847

775: \end{tabular}

776: \end{ruledtabular}

777: \end{table*}

778:

779: \subsection{Real-world social networks\label{sec:real_res}}

780:

781: Now we turn to the result for the bipartivity measures of real-world

782: networks. The values are presented in Table~\ref{tab:b}. For

783: comparison we also give values for the clustering coefficient (density

784: of triangles) $C$ and the density of squares $D$ in both directed and

785: undirected versions~\cite{note:cd}. Undirected networks are constructed

786: by taking the reflexive closure. At first glance at the table

787: we arrive at the pleasing conclusion that the bipartivity for the

788: pussokram.com networks is very high (as expected from a network of

789: romantic interaction of mostly heterosexuals). But disappointingly,

790: the bipartivity measures show similarly high values for the nioki.com

791: and email networks. This can be explained by the fact that nioki.com,

792: just like the pussokram.com, data has very low $C$ and $D$ values, and

793: presumably very few circuits at all. Now branches (subgraphs without

794: circuits that can be isolated by cutting one edge) does not give a

795: positive contribution to either $b_1$ or $b_2$, no matter of the

796: gender of the agents. The email network do have a high clustering, but

797: still rather high bipartivity. The reason is that the email network is

798: rather heavily fragmented and contains many isolated subnetworks of

799: two vertices and one edge, and three vertices and two edges. Such

800: subnetworks does not affect the clustering coefficient but tends to

801: decrease the bipartivity measures~\cite{note:improvement}.

802:

803: The collaboration networks consist of a number of fully connected

804: clusters (corresponding to a specific collaboration) that are

805: interconnected. It is thus natural that we see low bipartivity and a

806: high density of short circuits. The lower bipartivity values for the

807: company director network can be explained by smaller average size of such

808: fully-connected clusters: The average number of vertices per

809: collaboration is $9.5$ for the corporate director network and $2.5$

810: for the scientific collaboration data~\cite{davis,newman3}.

811:

812: The two small networks constructed from field surveys (the ``karate

813: club'' and ``prison'' network of Table~\ref{tab:b}, discussed in

814: Section~\ref{sec:soc}) show mid-range bipartivities and relative high

815: values of $C$ and $D$. From the above discussion we can expect that

816: the bipartivity of large, real, acquaintance networks is somewhere

817: between those of the collaboration networks and the Internet community

818: networks (because they probably have higher clustering than Internet

819: community networks, and lower number of fully connected clusters than

820: the collaboration networks). Encouraging enough, this is exactly what

821: we see in Table~\ref{tab:b}. Of course, the very small systems sizes

822: might affect the results, but that the bipartivity measures of

823: real-world acquaintance measures would be close to either the upper or

824: lower limits seems hard to believe.

825:

826: We conclude this section by a note on the parameters for the XMC

827: optimization. The measurement of $b_1$ for all real-world network

828: (except the nioki.com data where we study the convergence more

829: carefully) are done just once with the following simulation

830: parameters, $N_T=24$ (with temperatures from $0.002$ to $5$)

831: $t_\mathrm{avg}=1\times 10^7$, $t_\mathrm{measure}=16$,

832: $t_\mathrm{quench}=40$ and $t_\mathrm{exch} = 2\times 10^4$.

833:

834: \section{Summary and discussion}

835:

836: This paper concerns the quantification of the network structure

837: `bipartivity'---how close to bipartite a given graph is. We propose

838: two measures for this quantity. One quantity $b_1$ based on the

839: optimal two-coloring of the network---or, equivalently, the ground

840: state of the antiferromagnetic Ising model on the network. The

841: exact value of this quantity (that has been used in different roles

842: elsewhere) is NP-complete and thus in general not feasible

843: to calculate exactly. Instead we seek an approximate solution by a

844: simulated annealing approach. The simulated annealing is based on the

845: exchange Monte Carlo scheme. We argue that this unorthodox

846: minimization method helps us avoid local minima of the energy

847: landscape of the antiferromagnetic Ising model. Furthermore we develop

848: a measure $b_2$ based on the count of odd circuits that, for almost

849: all networks, is calculable in polynomial time.

850:

851: We propose three different random graph test models where one can

852: interpolate between arguably non-bipartite and bipartite graphs by

853: tuning a control parameter. Both our bipartivity measures are shown to

854: increase monotonically with tuning the control parameters towards the

855: bipartite extreme. From this we conclude that the bipartivity measures

856: really quantify the notion of bipartivity.

857:

858: By considering example networks we infer that bipartivity is a

859: structure that cannot be measured by currently popular structural

860: measures, such as the clustering coefficient. At the same time any

861: sensible quantification of bipartivity probably has to have a positive

862: correlation with the clustering coefficient for most networks (with

863: exceptions for exotic cases like Fig.~\ref{fig:ex}(a))---so, in that

864: case bipartivity and clustering is not independent.

865:

866: We measure $b_1$ and $b_2$ of a number of real-world networks,

867: constructed from online interaction, professional collaborations, and

868: field surveys. As expected, we see high bipartivity values for data

869: from the Internet community pussokram.com, where romantic contacts

870: are encouraged, and hence a high degree of heterophilous interaction

871: expected. We also see the expected low bipartivity values for the

872: professional collaboration and empirical acquaintance networks we

873: study. Disappointingly we cannot use our bipartivity measures to

874: distinguish between the networks driven by romantic or friendship (or

875: professional) contacts. To do this other structures and the network

876: sizes has to be taken into account, in a more elaborate analysis (that

877: is out of the scope of this study).

878:

879: So far our examples of networks with high bipartivity has been

880: romantic networks and networks of sexual contacts. Network-based

881: studies of sexually transmitted diseases~\cite{lea} is a potentially

882: interesting area for bipartivity measures, as the transmission rates

883: for homosexual and heterosexual contacts differ~\cite{anma}. Apart

884: from romantic and sexual networks, there are other areas where the

885: bipartivity measure may prove useful: One can consider a trade network

886: where some agents are more or less pronounced sellers and others are

887: primarily buyers (cf.\ Ref.~\cite{white}), such networks would not

888: have a neutral bipartivity. Another application is for the

889: `genealogical' network of a disease outbreak: Some contagious diseases

890: have a relatively stable duration between when an individual is

891: infected and when he or she becomes infectious. Epidemics of these

892: types of diseases can therefore roughly be divided into different

893: generations of infected individuals~\cite{anma}. A network

894: consisting of possible edges of infections, for an outbreak of this

895: type of disease, should therefore have very few odd-length

896: circuits. The reason is that the infection is only transmitted between

897: succeeding generations, which generates only circuits of even length

898: (in the reflexive closure of the network). When reconstructing the

899: paths this kind of disease has taken in a population, a minimization

900: of the bipartivity measures can be a method for excluding redundant

901: infectious edges.

902:

903: We conclude by an analogy to linear algebra---we have identified a new

904: dimension (structure) and proposed base vectors (measures), that

905: unfortunately are not orthogonal to the other dimensions.

906:

907: \section*{Acknowledgements}

908: We would like to thank Niklas Angemyr, Stefan Bornholdt, Gerald Davis,

909: Holger Ebel, Michael Lokner, Stefan Praszalowicz, and Christian

910: Wollter for help with data acquisition; and Johan Giesecke, James

911: Moody, Mats Nyl\'{e}n, and Pontus Svenson, for comments and

912: suggestions. P.H.\ was partly supported by the Swedish Research

913: Council through contract no.\ 2002-4135. F.L.\ was supported

914: by the National Institute of Public Health. C.R.E.\ was supported by

915: the Bank of Sweden Tercentenary Foundation. B.J.K.\ was supported by

916: the Korea Science and Engineering Foundation through Grant No.\

917: R14-2002-062-01000-0.

918:

919: \appendix

920:

921: \begin{figure}

922:   \centering{\resizebox*{8cm}{!}{\includegraphics{m2.eps}}}

923:   \caption{Marking of edges (in matrix representation) while

924:     calculating the $b_2$ quantity for a fully connected graph. `$-1$'

925:   means that $\nu$ at that position is decreased by one unit, `$=0$'

926:   means that $\nu=0$ at that position.

927:   }

928:   \label{fig:ma}

929: \end{figure}

930:

931: \section{The lower bound of the measure $b_2$\label{sec:bound}}

932:

933: In this Appendix we argue that, in the $N\rightarrow \infty$ limit,

934: the lower bound for $b_2$ is $1/2$ (just like $b_1$). First we

935: conjecture that the minimal value for $b_2$, just as for $b_1$, is

936: attained for complete graphs. (This will be further motivated below.)

937:

938: To assess $b_2$ for complete graphs, we note that~\cite{note:sigsum}

939: \begin{subequations}

940: \begin{eqnarray}

941:   \Sigma(C_n)&=&\sum_{\mathrm{odd}\;3\leq i\leq n}

942:   \frac{N!}{2(N-i)!}~\Rightarrow\label{eq:sigsum}\\

943:   \Sigma(C_3)&=&\frac{N(N-1)(N-2)}{2}\geq\nonumber\\&\geq&

944:   \frac{N(N-1)}{2}=M~,

945: \end{eqnarray}

946: \end{subequations}

947: so $\hat{n}=3$ which results in that $\nu=N-2$ for each edge.

948:

949: Now we apply the marking procedure of Sec.~\ref{sec:def}. Marking an

950: edge $(u,v)$ makes $\nu(u,v)= \nu(v,u)=0$. Furthermore, every edge

951: $(u,w)$ and $(v,w)$ ($w\neq u,v$) will be decreased by one since the

952: triangle $\{u,v,w\}$ now contains a marked edge. The discussion will

953: be simplified by considering a matrix representation of $\nu

954: (u,v)$. Marking $(u,v)$ sets $\nu(u,v)=\nu(v,u)=0$ and decreases the

955: $u$'th and $v$'th columns, and $u$'th and $v$'th rows by one (an

956: example is given in Fig~\ref{fig:ma}(a)). Marking another edge

957: $(u',v')$ ($u'$ and $v'$ are different from both $u$ and $v$,

958: otherwise $\nu(u',v')$ would not be maximal) will have the same effect

959: as marking the first. For positions like $(u,v')$ the original $\nu$

960: are decreased by 2 (see Fig.~\ref{fig:ma}(b)), since it has lost the

961: two passing triangles $\{u,u',v'\}$. and $\{v',u,v\}$. Continuing this

962: process we see that it takes $N/2+O(1)$ markings for $\nu$ of each

963: edge to be decreased by two units, and thus $m'=N^2/4+O(N)$ markings to

964: make $\nu=0$ for all edges. This gives $b_2=1/2$ in the $N\rightarrow

965: \infty$ limit. Since the appropriateness of $b_2$ as a bipartivity

966: measure is not really dependent on the limit values, we will not give

967: a rigorous proof that the correction is of a lower order for all

968: levels of the marking procedure (one level is the $N/2+O(1)$ edges

969: needed to be marked for $\nu$ to be decreased by at least two units

970: for each edge).

971:

972: \begin{figure}

973:   \centering{\resizebox*{7.5cm}{!}{\includegraphics{co.eps}}}

974:   \caption{The current value of $b_1$ (at the lowest-temperature level

975:     of the cooling) as a function of running time for ten independent

976:     measurements of the directed version of the nioki.com data.

977:   }

978:   \label{fig:co}

979: \end{figure}

980:

981: Now we argue that the $b_2$ takes its minimal value for complete

982: graphs. First we note that the number of circuits of length $n$ per

983: edge, for any $n$, is largest in a complete graph~\cite{review2}. So

984: if we set $\hat{n}$ arbitrarily and discard circuits of length $\leq

985: n$ in the calculation of $\nu(v)$, the fully connected graph would

986: give the highest $m'$ value and thus the lowest bipartivity

987: measure. The strongest candidate for a lower bipartivity measure than

988: that of a fully connected graph would thus be a graph such that the

989: $\Sigma(C_n)< 3M$ and $\Sigma(C_{n+2})$ is as big as possible for some

990: $n$. But the number edges needed to be removed from a fully connected

991: graph for $\Sigma(C_n)< 3M$ to hold, not only reduces the contribution

992: to $\nu$ from circuits of length $n$ but also from circuits of length

993: $n+2$ to a similar extent. If one performs the approximate marking

994: procedure outlined above for circuits of length five one starts from

995: $\nu=(N-2)(N-3)(N-4)$ and it takes $N/2+O(1)$ markings to decrease

996: every $\nu$ with at least $2N^2$. This means that the number of edges

997: needed to be marked to make $\nu = 0$ for every edge is the same if

998: circuits of length five is considered. It also means that a graph as

999: outlined above (with $\Sigma(C_n)< 3M$ and $\Sigma(C_{n+2})$ is as big

1000: as possible) probably do not have a lower $b_2$ than a complete graph.

1001:

1002: To epitomize, the $b_2$ measure lies in the interval $[1/2,1]$ in the

1003: $N\rightarrow \infty$ limit. The finite size corrections to $b_2$ for

1004: fully connected graphs, however, turns out to make $b_2$ slightly less

1005: than $1/2$.

1006:

1007: \section{Convergence of the simulated annealing\label{sec:simann}}

1008:

1009: To analyze the convergence of the simulated annealing scheme we run

1010: ten independent calculations of the $b_1$ quantity (with the same

1011: parameter values as in Sect.~\ref{sec:real_res}). The individual time

1012: evolutions of $b_1$ (at the lowest temperature $T=0.002$) for the

1013: different runs are shown in Fig.~\ref{fig:co}. We note that already

1014: after the first quench $b_1$ is only $3\%$ away from the value at the

1015: end of the run, and after 50 time steps $b_1$ is $0.5\%$ of the value

1016: after $1\times 10^7$ time steps. We note that there is no way of

1017: constructing a statistically valid confidence interval for the true

1018: $b_1$ value since an arbitrary complex energy landscape could have a

1019: global minimum with a basin of attraction of measure zero. There are

1020: however indications that this is seldom a major problem, at least not

1021: for the bisection problem~\cite{jerrum}.

1022:

1023: An interesting observation from Fig.~\ref{fig:co} is the step-like

1024: structure. This is a result of the exchange trials: After $t\approx

1025: 100$ the local minimum has been found, but at the temperature in

1026: question the system is in principle stuck in a confined part of the

1027: configuration space, and cannot enter lower lying energy

1028: valleys. In the time scale $t = 10^5$ there is another jump in the

1029: $b_1$ value. This is related to that other replicas from other parts

1030: of the configuration space reaches the lowest level. At around

1031: $t=10^6$ the current highest $b_1$ values (lowest energy) reaches

1032: another plateau. At this time, each replica should have covered the

1033: whole temperature range several times. This second plateau gives two

1034: encouraging implications: Firstly, that the correct value of $b_1$

1035: probably is not very far off the measured value. Secondly, that the

1036: exchange steps really are helpful. If one wants to run this algorithm

1037: more efficiently the $t_\mathrm{exch}$ we use is far too large (but

1038: beneficial for separating the time scales in the discussion

1039: above). Ideally $t_\mathrm{exch}$ should probably be chosen to be of

1040: the same order as the first jump (from the regular Monte Carlo

1041: steps)---in the nioki.com network (displayed in Fig.~\ref{fig:co})

1042: this would be $t\approx 100$.

1043:

1044: \begin{thebibliography}{99}

1045: \bibitem{review} S.~H.\ Strogatz, Nature (London) \textbf{410}, 268

1046:   (2001); R.\ Albert and A.-L.\ Barab\'{a}si,  Rev.\ Mod.\

1047:   Phys.\ \textbf{74}, 47 (2002); S.~N.\ Dorogovtsev and J.~F.~F.\

1048:   Mendes, Adv.\ Phys.\ \textbf{51}, 1079 (2002).

1049: \bibitem{review2}   M.\ E.\ J.\ Newman,  SIAM Rev., (to appear).

1050: \bibitem{WS} D.~J.\ Watts and S.~H.\ Strogatz, Nature (London) \textbf{393},

1051:   440 (1998).

1052: \bibitem{sf} R.\ Albert and A.-L.\ Barab\'{a}si, Science

1053:   \textbf{286}, 509 (1999).

1054: \bibitem{assmix} M.\ E.\ J.\ Newman, Phys.\ Rev.\ Lett.\ \textbf{89},

1055:   208701 (2002).

1056: \bibitem{grid} G.\ Caldarelli, R.\ Pastor-Santorras, and A.\

1057:   Vespignani, ``Cycles structure and local ordering in complex

1058:   networks'' e-print arXiv:cond-mat/0212026 (unpublished).

1059: \bibitem{jeong} H.\ Jeong, B.\ Tombor, R.\ Albert, Z.\ N.\ Oltvai, and

1060:   A.-L.\ Barab\'{a}si, Nature \textbf{407}, 651 (2000).

1061: \bibitem{liljeros} F.~Liljeros, C.~R.\ Edling, L.~A.~N.\ Amaral,

1062:   H.~E.\ Stanley, and Y.~{\AA}berg, Nature (London) \textbf{411}, 907

1063:   (2001).

1064: \bibitem{lea} F.~Liljeros, C.~R.\ Edling, L.~A.~N.\ Amaral, Microbes

1065:   and Infection, (2003, to appear).

1066: \bibitem{partner} P.\ S.\ Bearman, J.\ Moody, and K.\ Stovel,

1067:   ``Chains of affection: The structure of adolecent romantic and

1068:   sexual networks'', Institute for Social and Economic Research and

1069:   Policy, report no.\ 02-04 (2002, unpublished).

1070: \bibitem{freeman} When the type of every vertex is known, this

1071:   structure can be measured by Freeman's segregation index $S$, which

1072:   is (roughly speaking) the fraction cross-type edges missing in a

1073:   graph, compared with a completely random graph---a graph that is

1074:   close to bipartite would then have $S<0$. L.\ C.\ Freeman,

1075:   Sociological Methods and Research \textbf{6}, 411 (1978). See also:

1076:   J.\ C.\ Mitchell, Connections \textbf{2}, 9 (1978);  L.\ C.\

1077:   Freeman, Connections \textbf{2}, 13 (1978).

1078: \bibitem{HEL} P.\ Holme, C.\ R.\ Edling, and F.\ Liljeros, ``Structure

1079:   and Time-Evolution of the Internet Community pussokram.com'',

1080:   e-print arXiv:cond-mat/0210514 (unpublished).

1081: \bibitem{hope} It should be noted that many NP-hard optimization

1082:   problems display phase transitions between ``easy'' and ``hard''

1083:   regimes, e.g.\ the 3-coloring problem is known to be hard in the

1084:   small-world regime of the WS model~\cite{WS}. T.\ Walsh, in

1085:   \textit{Proceedings of the 16th International Joint Conference on

1086:   Artificial Intelligence} edited by T.\ Dean (Morgan Kaufmann, San

1087:   Francisco, 1999). For general references, see e.g.: P.\ Cheeseman,

1088:   B.\ Kanefsky, and W.~M.\ Taylor, in \textit{Proceedings of IJCAI-91}

1089:   edited by J.\ Mylopoulos and R.\ Reiter (Kaufmann, San Mateo, 1991),

1090:   pp.\ 331-337; T.\ Hogg, B.\ A.\ Huberman, and C.\ P.\ Williams,

1091:   Artificial Intelligence \textbf{88}, 1 (1996).

1092: \bibitem{jerrum} M.\ Jerrum and G.\ Sorkin, ECS-LFCS-93-260 (1993,

1093:   unpublished).

1094: \bibitem{schreiber} G.\ R.\ Schreiber and O.\ C.\ Martin, SIAM J.\

1095:   Optim.\ \textbf{10}, 231 (1999).

1096: \bibitem{FA} Y.\ Fu and P.~W.\ Anderson, J.\ Phys.\ A: Math.\ Gen.\

1097:   \textbf{19} (1986) 1605.

1098: \bibitem{alava} M.~J.\ Alava, P.~M.\ Duxbury, C.~F.\ Moukarzel, and

1099:   H.\ Rieger in \textit{Phase Transitions and Critical Phenomena},

1100:   Vol.~18, edited by C.~Domb and J.~L.\ Lebowitz (Academic

1101:   Press, London, 2001), pp.\ 143-317.

1102: \bibitem{karp} R.~M.\ Karp in \textit{Complexity of Computer

1103:     Computations}, edited by R.~E.\ Miller and J.~W.\ Thatcher (Plenum

1104:     Press, New York, 1972), pp.\ 85-103.

1105: \bibitem{bw} A.~Barrat and M.~Weigt, Eur.\ Phys.\ J.\ B \textbf{13},

1106:   547 (2000).

1107: \bibitem{spin}

1108:   See e.g.:

1109:   M.~Gitterman, J.\ Phys.\ A: Math.\ Gen.\ \textbf{33}, 8373 (2000);

1110:   P.~Svenson, Phys.\ Rev.\ E \textbf{64}, 036122 (2001);

1111:   B.~J.\ Kim, H.\ Hong, P.\ Holme, G.\ S.\ Jeon, P.\ Minnhagen, and

1112:   M.\ Y.\ Choi, Phys.\ Rev.\ E \textbf{64}, 056135 (2001);

1113:   C.~P.\ Herrero, Phys.\ Rev.\ E \textbf{65}, 066110 (2002); A.\

1114:   Aleksiejuk, J.~A.\ Holyst, and D.\ Stauffer,  Physica A

1115:   \textbf{310}, 260 (2002); G.\ Bianconi, ``Mean field solution of the

1116:   Ising model on a Barabasi-Albert network'', e-print

1117:   arXiv:cond-mat/0204455 (unpublished); A.\ Aleksiejuk-Fronczak,

1118:   ``Microscopic model for the logarithmic size effect on the Curie

1119:   point in Barab\'{a}si-Albert networks'', e-print

1120:   arXiv:cond-mat/0206027 (unpublished); D.\ Boyer and O.\ Miramontes,

1121:   ``Interface Motion and Pinning in Small World Networks'', e-print

1122:   arXiv:cond-mat/0210352 (unpublished); K.\ Medvedyeva, P.\ Holme, P.\

1123:   Minnhagen, and B.\ J.\ Kim, ``Dynamic critical behavior of the

1124:   \textsl{XY} model in small-world networks'', e-print

1125:   arXiv:cond-mat/0301510 (unpublished).

1126: \bibitem{socstatmech}

1127:   D.\ B.\ Bahr and E.\ Passerini, J.\ Math.\ Sociol.\ \textbf{23}, 1

1128:   (1998); D.\ B.\ Bahr and E.\ Passerini, J.\ Math.\ Sociol.\

1129:   \textbf{23}, 29 (1998); S.\ N.\ Durlauf, Proc.\ Natl.\ Acad.\ Sci.\

1130:   USA \textbf{96}, 10582 (1999); H.\ P.\ Young, in \textit{The Economy

1131:   as an Evolving Complex System} edited by L.\ E.\ Blume and S.\ N.\

1132:   Durlauf, (Oxford University Press, Oxford, 2003).

1133: \bibitem{barahona} For an interesting discussion on this problem in a

1134:   the somewhat more complex Ising spin glass model, see: F.~Barahona,

1135:   J.\ Phys.\ A: Math.\ Gen.\ \textbf{15}, 3241 (1982).

1136: \bibitem{simann} S.\ Kirkpatrick, C.\ D.\ Gelatt, and M.\ P.\ Vecchi,

1137:   Science \textbf{220}, 671 (1983).

1138: \bibitem{xmc} K.~Hukushima and K.~Nemoto, J.\ Phys.\ Soc.\ Jpn.\

1139:   \textbf{65}, 1604 (1996).

1140: \bibitem{intro} See any introductory text on graph theory, for

1141:   example: A.\ Tucker, \textit{Applied Combinatorics} 3 ed.\ (Wiley,

1142:   New York, 1995), p.~31.

1143: \bibitem{ww} L.~R.\ Walker and R.\ E.\ Walstedt, Phys.\ Rev.\ B

1144:   \textbf{22}, 3816 (1980).

1145: \bibitem{johnson} D.\ B.\ Johnson, SIAM J.\ Comput.\ \textbf{4}, 77

1146:   (1975).

1147: \bibitem{note:alt} The intuitive way to find a least upper bound might

1148:   be to search at two levels simultaneously ($\bar{n}$ and

1149:   $\bar{n}-2$) and decrease the bound $\bar{n}\mapsto\bar{n} -2$ when

1150:   $\Sigma_{\bar{n} -2}\geq M$ ($\Sigma_{\bar{n} -2}$ denotes the sum

1151:   of the length of all circuits shorter than or equal to $\bar{n}

1152:   -2$. This would slow down the computation considerably since our

1153:   modified Johnson's algorithm mostly finds circuits of the length of

1154:   the search depth $\bar{n}$, and thus it takes a long time to

1155:   increase $\Sigma_{\bar{n} -2}$.

1156: \bibitem{note:bip} Consider $K_{N/2,N/2}=(V,U,E)$ where $V$ and $U$

1157:   are the two vertex sets. We write a circuit as a $k$-tuple

1158:   $(v_1,u_1,\cdots,v_{k/2},u_{k/2})$ where $v\in V$ and $u\in U$. Then

1159:   there are $(N/2)\,(N/2)\,\cdots\, [N/2-(k/2-1)]\,[N/2-(k/2-1)] =

1160:   [(N/2)!/ (N/2-k/2)!]^2$ distinct $k$-tuples. As for circuits, the

1161:   choice of start-vertex $v_1$ does not matter, neither does the

1162:   direction matter. To compensate for this we divide by $1/2k$ to get

1163:   the right number of circuits of length $k$ in $K_{N/2,N/2}$.

1164: \bibitem{SCC} A.\ V.\ Aho, J.\ E.\ Hopcroft, and J.\ D.\ Ullman,

1165:   \textit{The Design and Analysis of Computer Algorithms}

1166:   (Addison-Wesley, Reading, 1974), pp.\ 189-195.

1167: \bibitem{ER} P.~Erd\"{o}s and A.~R\'{e}nyi, Publ.\ Math.\ Inst.\

1168:   Hung.\ Acad.\ Sci.\ \textbf{5}, 17 (1960).

1169: \bibitem{nws}�M.\ E.\ J.\ Newman, S.\ H.\ Strogatz, and D.\ J.\ Watts,

1170:   Phys.\ Rev.\ E \textbf{64}, 026118 (2001).

1171: \bibitem{WAHO} G.~H.\ Wannier, Phys.\ Rev.\ \textbf{79}, 357 (1950);

1172:   R.~M.~F.\ Houtappel, Physica (Amsterdam) \textbf{16}, 425 (1950).

1173: \bibitem{smith} R.\ Smith, ``Instant Messaging as a Scale-Free

1174:   Network'', e-print arXiv:cond-mat/0206378 (unpublished).

1175: \bibitem{ebel} H.\ Ebel, L.\ I.\ Mielsch, and S.\ Bornholdt, Phys.\

1176:   Rev.\ E \textbf{66}, 035103 (2002).

1177: \bibitem{newman3} M.\ E.\ J.\ Newman, Phys.\ Rev.\ E \textbf{64},

1178:   016131 (2001).

1179: \bibitem{davis} G.\ F.\ Davis, M.\ Yoo, and W.\ E.\ Baker, ``The small

1180:   world of the corporate elite''; preprint, University of Michigan

1181:   Business School (2001).

1182: \bibitem{newman1} M.\ E.\ J.\ Newman, Phys.\ Rev.\ E \textbf{64},

1183:   025102 (2001).

1184: \bibitem{zach} W.\ Zachary, Journal of Anthropological Research

1185:   \textbf{33}, 452 (1977).

1186: \bibitem{prison} J.\ MacRae, Sociometry \textbf{23}, 360 (1960).

1187: \bibitem{note:not_really} Actually the $b_1$ value for Model 1 is

1188:   $0.2\%$ lower (around three standard deviations) for $r_1=0.1$ than

1189:   for $r_1=0$. We will not speculate in the reason for this since the

1190:   effect is small and the overall picture is clear.

1191: \bibitem{note:cd} If one denotes the number of representations of

1192:   (directed or undirected) circuits of length $n$ by $c(n)$ and the

1193:   number of representations of paths of length $n$ by $p(n)$. (By

1194:   representations we mean different ways of listing adjacent vertices;

1195:   so, for example, a triangle has six representations.) Then we can

1196:   can define $C=c(3)/p(3)$ and $D=c(4)/p(4)$. For more detailed

1197:   definitions, see Refs.~\cite{bw} and \cite{HEL}.

1198: \bibitem{note:improvement} A potential improvement would be to measure

1199:   $b_1$ and $b_2$ on the 2-core (the maximal subgraph with minimal

1200:   degree 2) of $G$. This would eliminate circuit-free subgraphs that

1201:   contains no information about the degree of heterophilous preference

1202:   among the agents forming the network.

1203: \bibitem{anma} R.\ M.\ Anderson and R.\ M.\ May, \textit{Infectious

1204:     diseases of humans} (Oxford University Press, Oxford, 1991).

1205: \bibitem{white} H.\ C.\ White, American Journal of Sociology

1206:   \textbf{87}, 517 (1981).

1207: \bibitem{note:sigsum} In the $K_N$, the number of circuits of length

1208:   $i$ is the $i$-permutations $N!/(N-i)!$ divided by $2i$ (a factor

1209:   $i$ to compensate for the over-counting since a circuit is

1210:   independent of starting vertex; a factor 2 to compensate for the

1211:   double counting of the two directions). For $K_N$ the contribution

1212:   of circuits of length $i$ to the sum is $i$ times the number of

1213:   them, this gives Eq.~(\ref{eq:sigsum}).

1214:   \end{thebibliography}

1215: \end{document}

1216: