0602:physics0602063/lj.tex

1: \documentclass[twocolumn,a4paper,aps,pre,preprintnumbers]{revtex4}

2: %\usepackage[pdftex]{graphicx}

3: \usepackage{graphicx}

4: \usepackage{amssymb,amsmath,amsthm,subfigure,hyperref}

5: %\usepackage{lineno}

6:

7: %\pdfcompresslevel=9

8: %\renewcommand{\textfraction}{0.10}

9: %\renewcommand{\floatpagefraction}{0.99}

10:

11: %\pagestyle{fancy}

12: %\fancyhf{}

13: %\headheight 35pt

14: % Use the CVSID as the center footer

15:

16: \newcommand{\comment}[1]{}

17: \newcommand{\grad}{\ensuremath{^\circ}}

18: %\newcommand{\comment}[1]{\emph{#1}}

19: %\renewcommand{\baselinestretch}{1.9}

20: \begin{document}

21: %\preprint{pre-final draft. Exact numbers and figures are subjects to change}

22: \title{Thermodynamic approach for community discovering within the

23:   complex networks: LiveJournal study.}

24: \author{Pavel Zakharov}

25: \affiliation{Department of Physics, University of Fribourg, CH-1700,

26:   Switzerland, email: Pavel.Zakharov@unifr.ch}

27:

28: \date{\today}

29:

30: \begin{abstract}

31: The thermodynamic approach of concentration mapping is used to

32: discover communities in the directional friendship network of LiveJournal

33: users. We show that this Internet-based social network has a power-law

34: region in degree distribution with exponent $\gamma = 3.45$. It is

35: also a small-world network with high clustering of nodes. To study the community structure we

36: simulate diffusion of a virtual substance immersed in such a network as in

37: a multi-dimensional porous system. By analyzing concentration profiles at

38: intermediate stage of the diffusion process the well-interconnected

39: cliques of users can be identified as nodes with equal values of concentration.

40: \end{abstract}

41: \pacs{89.75.Hc, 05.10.-a, 87.23.Ge, 89.20.Hh}

42:

43: \maketitle

44: %\linenumbers

45:

46:

47: \begin{figure}

48: \centering

49:   \includegraphics[width=\linewidth]{fig1.eps}

50:   \caption{Probability density functions of in- and out-degrees for

51:   LiveJournal

52:   users. The line shows a slope of -3.45 which equally well fits

53:   $P(k_{in})$ and $P(k_{out})$.}

54:   \label{fig:degrees}

55: \end{figure}

56:

57: \section{INTRODUCTION}

58:

59: In recent years there has been an enormous breakthrough in research of

60: complex networks due to the application of statistical

61: physics methodology \cite{albert:review,dorogovtsev:review,dorogovtsev03:book}. Many different complex

62: systems instead of being completely random prove to have signatures

63: of organization such as clustering and power-law distribution of

64: links. Together with the small-world property \cite{milgram:small} these are the inherent features

65: of an extremely wide variety of systems such as the World-Wide

66: Web \cite{albert:diameter,kleinberg:web99,kumar:web00},

67: Internet \cite{satorras:internet}, collaboration networks of movie

68: actors \cite{watts:1998,newman:random} and scientists \cite{newman:random}, the web of human

69: sexual contacts \cite{liljeros:2001} and many others. In spite of the

70: fact that some concepts of complex networks theory were originally

71: introduced in sociology the statistical study of social networks is

72: complicated by the difficulty in reliable data collection due to certain privacy and ethical reasons.

73:  One of the solutions for this problem is the analysis of collaboration

74: networks \cite{watts:1998,newman:random}, e-mail

75: interactions \cite{arenas:email,arenas:community,newman:email}, instant

76: messaging \cite{smith:2002} and online blogging \cite{kumar:bursty,kumar:structure,nowell:phd,nowell:pnas}.

77: \comment{

78: Nowell {\em et al.} recently studied geographic aspects of LiveJournal

79: (www.livejournal.com) blog space  and they reported parabolic shape of

80: friendship degrees distributions \cite{nowell:pnas}. }

81: Here we studied basic structural properties of LiveJournal blog service

82: social network and demonstrated the diffusion-motivated method to

83: discover communities on the case of this network.

84:

85: \section{LIVEJOURNAL NETWORK}

86:

87: LiveJournal (LJ) is an online web-based journal service

88: with an emphasis on users interactions \cite{lj:faq}. In January 2006 it had $9.3 \cdot 10^6$

89: users in total, $2.0

90: \cdot 10^6$ of them were {\em active in some way} according to

91: official  LiveJournal statistics \cite{lj:stat}.

92: The essential feature of LJ service is the ``friends'' concept which helps

93: users to organize their reading preferences and

94: provides security regulations for their journal entries and personal data. Friends list

95: is an open information and can be accessed through a conventional WWW

96: interface or through a dedicated bot interface provided by LJ system.

97:

98: Data collection was performed by crawler programs running simultaneously

99: on two computers and exploring the LJ space by following directional

100: friendship links starting from two users with a large number of

101: incoming friendship links. For each user the crawler was obtaining his friends list (outgoing links) and the

102: number of users who have the given user in their friends list (incoming

103: links). Each user from the friends list which was not yet explored by

104: the crawler was added to the end of the processing queue if he was not already

105: there. If the user was in the queue his queue score was

106: incremented every time he was found in someones' friends list. Users

107: with higher queue scores were processed first. This ensured fast

108: collection of the essential part of the network. Basically this algorithm

109: is a modification of Tarjan's depth-first search algorithm for

110: finding the connected component of a graph \cite{tarjan:alg72,hopcroft:alg73}.

111: Total time of collection was 14 days with the total number of

112: discovered users $3\:746\:264$ found in a connected component. We are aware

113: that during the time of collection the network was

114: undergoing continuous changes. We estimated the number of users deleted from the LJ

115: database but still present in the friends lists was less than 0.1\%

116: which makes us believe that the evolution of LJ network did not

117: influence our statistics much.

118:

119: The estimated probability distribution functions of in- and out-degree are presented in

120: log-log scale in the Fig.~\ref{fig:degrees}.  The estimated mean

121: of the numbers of  outgoing  and incoming friendship links is $\langle k_{out}\rangle =

122: 15.91$ and $\langle k_{in}\rangle = 16.07$, correspondingly. The

123: average in-to-out ratio $\langle k_{in} / k_{out}\rangle =

124: 1.157$. The number of incoming links is slightly larger than the

125: number of outgoing due to the fact that only the outgoing links were

126: used for crawler navigation so some of the LJ users were unreachable by

127: directional links but they were listed in the users pages.

128:

129: There are also several technical restrictions for the degrees: maximum

130: number of friends per user is 750 and only 150 of them can be listed on the users' info

131: page and can be effortlessly accessed by the LJ users. From our experience LJ bots interface

132: does have some problems listing the users who consider a certain

133: user as a friend if there are more than 2500 of them hence we cut the

134: data at $k_{out\: max} = 2500$.

135:

136:  As one can see from the Fig.~\ref{fig:degrees} in- and out-degree distributions

137: reveal a power-law decay $P(k) \sim k^{-\gamma} $ for $k > 100$ with

138: the value of the exponent $\gamma_{in} \approx \gamma_{out} =  3.45 \pm 0.05$ which is

139: surprisingly close to the values $\gamma_{in} \approx \gamma_{out} \approx

140: 3.4$ obtained by Liljeros {\em et al.} for sexual

141: contacts \cite{liljeros:2001}. Scaling of the distributions contradicts the

142: results of Nowell {\em et al.} \cite{nowell:pnas}

143: who reported parabolic shape of LJ degrees distributions.

144: The skewness of

145: the distributions in our case can be explained by the social origin of

146: LJ network. As it is pointed out by Jin {\em et

147:   al.} \cite{jin:structure} degree distribution for social

148: networks does not appear to follow power-law distribution due to the

149: cost in terms of time and efforts to support friendship. In the case of

150: LJ network the cost of friendship is the size of friends feed which

151: accumulates all the recent entries of the user's friends. We can also

152: separate two classes of LJ users: ``readers'' and ``writers''. The first are

153: mainly using their accounts to read the journals of others. They update

154: journals only episodically and are not deeply involved in LJ

155: community life. They do not have many incoming and outgoing links and

156: they are responsible for skewness of the distributions for $k <

157: 100$. Meanwhile active ``writers'', representing minority of the registered users

158: exploit full capacity of LJ system. They spend much time participating

159: in LJ community life, and they have a larger number of incoming and

160: outgoing links which are distributed by power-law.

161:

162: The origin of power-law region in the distributions can be explained by

163: continuous evaluation and self-organization of the LJ network and preferential attachment

164: mechanism similar to the general WWW growth mechanism

165: \cite{barabasi:1999}. One an interesting journal gets popular it will

166: be cited and promoted in

167: the journals of its readers which will help to further increase its

168: popularity which leads to a ``rich-get-richer'' effect occurring

169: in many network systems

170: \cite{barabasi:1999,dorogovtsev:review}. However linear growth

171: with linear preferential attachment protocol leads to a power-law

172: degree distribution with $\gamma = 3$ which is smaller than the exponent

173: obtained for our study. Larger values of exponent can be

174: explained by alternative growth mechanisms: preferential attachment

175: with rewiring \cite{albert:topology00} and copying mechanism

176: \cite{kleinberg:web99,kumar:web00}. Rewiring in LJ system implies that

177: users are not only establishing new friendship links but also breaking

178: the old ones while copying occurs when the user inherits part of the

179: friendship connections of his friends. The latter effect is called

180: ''transitivity'' in sociology \cite{wasserman:94} and is responsible for

181: users cliques formation or clustering.

182:

183: We characterize clustering of LJ users by calculating the clustering

184: coefficient as introduced by Watts and Strogatz

185: \cite{watts:1998,albert:review}. It is defined as the number of links between user's friends divided

186: by the maximum possible number of links between them averaged over all

187: users in the network. If the user $i$ has $k_i$

188: friends with $E_i$ links between them the maximum possible number of

189: directed links is $k_i (k_i - 1)$ and the clustering coefficient for the

190: user $i$ in the case of directed network can be defined as:

191: \begin{equation}

192: C_i = \frac{E_i}{k_i (k_i - 1)}.

193: \label{eq:clustering}

194: \end{equation}

195: The average clustering coefficient for the whole network as calculated

196: from our data is: $C = \langle C_i \rangle_{i=1..N} \approx 0.3302$. It is worth to

197: compare this value to the clustering coefficient of a random

198: directional Erd\H{o}s-R\'{e}nyi graph which can be found as $C_{rand}

199: = \langle k \rangle / (N - 1)$ which for LJ network is ca. $4.24 \cdot 10^{-6}$. The fact that actual

200: clustering coefficient for LJ network is nearly five orders of magnitude larger than

201: it would be expected from randomly linked network with the same degree and

202: size is a clear indication of high user clustering.

203:

204:

205: The peculiar feature of the LJ network is the high

206: reciprocity \cite{wasserman:94} of friendship links. We found that 79.26\% of links

207: are bi-directional which means that this percentage of outgoing links

208: is returned as incoming and {\em vice versa}

209: the same percentage of incoming links originates from users friends.

210: This value is higher than reciprocity 57\% found for the WWW

211: \cite{newman:email02} which is the technical environment of LJ. Increasing

212: of reciprocity may be explained by social origin of LJ network. Due to

213: the rules of social interactions user $A$ usually feels obliged to establish a

214: friendship connection to the user $B$ if such a connection was already

215: established by $B$ to $A$. Another explanation for high reciprocity is

216: that often relations in

217: LJ space is based on real-life people relations which means that

218: LJ users are linking to the other users which are their friends in the real

219: world. In this case the LJ network directly inherits the undirectional

220: structure of the underlying social network.

221:

222:

223: \begin{figure}

224: \centering

225:   \includegraphics[width=\linewidth]{fig2.eps}

226:   \caption{Probability distribution function of the minimum path length

227:     between LiveJournal users through the directional friends links.}

228:   \label{fig:path_distance}

229: \end{figure}

230:

231:

232: In order to characterize small-world properties of LJ network we estimated the

233: probability distribution function $P_\ell(\ell)$ of the minimum path distance

234: or hopcount between the nodes through directional links. The results are presented in

235: the Fig.~\ref{fig:path_distance}. The average distance estimated for

236: our set of data is $\langle \ell \rangle = 5.86$.

237: \comment{

238: According to the

239: general approach developed by Newman \textit{et al.}

240: \cite{newman:random} an average path length can be estimated using the

241: following expression:

242: \begin{equation}

243: \mathcal{\ell} = \frac{ln(N / z_1)}{ln(z_2/z_1)} + 1,

244: \end{equation}

245: where $N$ is the size of the network and $z_1 = \langle k_{out} \rangle $ and $z_2$ is the number

246: of the first and the second neighbours. From this we obtained $\ell \approx

247: 4.3$ which is significantly smaller than the value obtained from

248: the distribution. We are considering this as a first sign of structure

249: within LJ network.

250: }

251: Based on the recently obtained expression for the mean distance between

252: the nodes in scale-free networks by Hooghiemstra \textit{et al.} \cite{hoog:mean05}

253: who improved the widely used result of Newman \textit{et

254:   al.} \cite{newman:random} the value of $\langle \ell \rangle$ can be estimated

255: as the following:

256: \begin{equation}

257: \langle \mathcal{\ell} \rangle_{th} \approx \frac{ln N}{ln \nu} + \frac{1}{2} -

258: \left ( \frac{\gamma_e + ln \mu - ln (\nu - 1)}{ln \nu} \right ) - 2

259: \frac{\epsilon}{log \nu},

260: \label{eq:hoog}

261: \end{equation}

262: where $N$ is the size of the network, $\mu = \langle k \rangle$,

263: $\nu = \langle k (k - 1)\rangle / \langle k \rangle$, $\gamma_e

264: \approx 0.577$ is the Euler-Mascheroni constant, and $\epsilon$ is the

265: expectation of the logarithm of the limit of a super-critical

266: branching process which depends on the scaling exponent $\gamma$

267: and belongs to the half-open interval $(-1,0]$, where the lower

268:   boundary is the numerical extrapolation of the results from

269:   \cite{hoog:mean05} and the upper boundary the theoretical prediction

270:   for $\gamma > 3$.

271:

272: For LJ data the equation \eqref{eq:hoog} gives the following range of the mean distance:

273: $ 4.53 \le \langle \mathcal{\ell} \rangle_{th} < 5.05$ which is in any case

274: smaller than statistically obtained value. This theoretical prediction

275: assumes the homogeneity of the graph, and  we believe the possible reason

276: for such an underestimation of the mean path length is the

277: macroscopic structuring of the network which is discussed further.

278:

279: \begin{figure}

280: \centering

281:   \includegraphics[width=\linewidth]{fig3.eps}

282:   \caption{Illustration of the community detection algorithm. After

283:     diffusion process starts from the initiator node virtual ink

284:   propagates through network links. Communities can be recognized as the

285:     groups of nodes with similar amount of ink.}

286:   \label{fig:scheme}

287: \end{figure}

288:

289:

290: \section{COMMUNITY DISCOVERING METHOD}

291:

292: It seems to be quite natural for the nodes of the complex networks

293: to aggregate into macroscopic structures with high internal links

294: density and weak connection to the rest of the network. Such groups are

295: often referred to as communities. Particular reasons for communities

296: formation may depend on the type of the network but this feature

297: proved to be quite universal and can be found in social, biological and

298: computer networks \cite{girvan:pnas02,newman:fast}. Finding these

299: structures within the network is the major step towards understanding its topology.

300:

301: This problem is known as a graph-partitioning problem in graph theory

302: and has a nondeterministic polynomial (NP) complexity  which makes it

303: almost inapplicable for large networks.

304:

305: Recent advances in the study of complex networks stimulated the search

306: of alternative techniques for community discovering and many original solutions

307: were proposed

308: \cite{girvan:pnas02,newman:mixing,newman:community,newman:fast,clauset:2004,pons:rw05,wu:physics04,simonsen:diff04}.

309: These algorithms can be divided into two main classes: \textit{divisible}, which

310: hierarchically split the network by removing edges with the highest

311: betweenness \cite{girvan:pnas02,newman:community} and

312: \textit{agglomerative} which start from the maximal community

313: division when each node belongs to its own separate community and

314: continuously merge these communities basing on some parameter of

315: nodes similarity \cite{wu:physics04,pons:rw05} or optimizing

316: the partitioning. In their recent work Clauset \textit{et al.}

317: \cite{clauset:2004} used the greedy optimization in order to maximize

318: the \textit{modularity} measure of partitioning quality

319: \cite{newman:community,newman:fast}. Currently this method is one of the

320: fastest and runs in time $O (M H ln N)$, where $M = \langle k \rangle N$ is the number

321: of edges in the network and $H$ is the number of decomposition levels

322: which is usually small ($H = O (ln N)$)

323: \cite{clauset:2004,pons:rw05}. In a sparse network the degree is limited

324: and $M = O (N)$ and so the complexity is $O (N ln^2 N)$ which makes it

325: fastest nowadays.

326:

327: Here we propose a method to find communities based oh the principles

328: of thermodynamics. When the system gets large enough so that the behavior

329: of its microscopic constituents can be successfully averaged to give

330: basis for a scientific descriptions of phenomena with avoidance of

331: microscopic details. Since in thermodynamics behavior of the system can

332: be described without solving the equation of motion of every

333: constituent molecule we believe that structure of the large complex

334: network can be explored without explicit solution of graph

335: partitioning problem.

336:

337: Our current study is based on the simulation of a mass diffusion process in the complex network

338: as in a multi-dimensional porous system with directional links following

339: physical laws. The diffusion process initiated at one of the nodes by

340: addition of the virtual ink produces a non-uniform mass distribution at the intermediate state

341: which can be used to reveal well-interconnected communities within the

342: complex network by selecting the nodes with similar concentration

343: values. In this sense our method falls in the class of agglomerative

344: techniques with the concentration as the similarity measure. However, it

345: can be shown that the quantity $r_{AB} = | ln \phi_A - ln \phi_B |$,

346: where $\phi_A$ and $\phi_B$ are two values of concentration  in the nodes $A$

347: and $B$, as the measure of distance between these nodes. Thus

348: edge betweenness, characterized as the drop of the logarithm of concentration

349: along the edge, can be used for hierarchical decomposition of the

350: network.

351:

352: The similar measure of distance between nodes based on the random walk

353: has been recently introduced by Pons \textit{et al.} \cite{pons:rw05}

354: for the class of undirected networks. It is defined as the difference in probabilities

355: for a random walker to reach nodes the $A$

356: and $B$ in certain number of steps $t$ starting from some node

357: $Z$. As these probabilities for a large $t$ are mainly determined by

358: the in-degrees of the nodes the values of distance should be normalized

359: A short number of steps $t$ may depend on a particular

360: network and should be known in advance. Pons \textit{et al.} also pointed out

361: conceptual difficulties of the random walk scheme application for the directed

362: networks \cite{pons:rw05}. Several other diffusion motivated

363: approaches proposed recently (\textit{e.g.}

364: \cite{wu:physics04,simonsen:diff04,fouss:novel05}) are more or

365: less consistent with random-walk analogy.

366:

367: In our model we break the similarity with classical random

368: walks and the theory of flows in the graph \cite{diestel:graph} in

369: favour of a realistic physical picture. First, we allow nodes to

370: accumulate substance by assigning to them infinite maximum capacity.

371: The direct flow from the node $A$ to the node $B$ is possible if there is

372: a directed link from $A$ to $B$ and $\phi_A > \phi_B$. The flow rate

373: in this case depends on the concentration difference $\phi_B - \phi_A > 0$ and the out degree

374: $k_{out}$ of the node $A$. In the case of $ A < B$ no mass is

375: delivered directly from $A$ to $B$. Such rules in the limit of

376: infinite time lead to equilibrium state with equal mass distribution

377: which meets the physical expectations.

378:

379: Network links in our realization represent pipes (Fig.~\ref{fig:scheme}), directed

380: links  act as pipes allowing mass to pass in one direction. Mass

381: propagation within the network system is driven by Flick's law of diffusion:

382: \begin{equation}

383: %J  = - D \frac{\delta \phi}{\delta x}

384: dM = - D \frac{\delta \phi}{\delta x} dS dt,

385: \end{equation}

386: where $dM$ is mass change, $\delta \phi / \delta x$ is

387: concentration gradient and $dS$ is an area element.

388:

389: For our discrete system this implies that the rate of mass

390: exchange between the neighbouring nodes is proportional to the difference of masses in these

391: nodes. Every node uses its outgoing links to deliver mass to its neighbors with

392: a smaller amount of ink. The amount of ink $\Delta_{out} M_i$

393: delivered by the node to its $i$th neighbour is:

394: \begin{equation}

395: \Delta_{out} M_i = - \frac{\alpha}{k_{out}} (M_0 - M_i),

396: \label{eq:main}

397: \end{equation}

398: where $M_0 > M_i$ and $\alpha$ is the coefficient determining the

399: transfer rate and is constant for all

400: nodes. We analyze the mass $M$ contained in the node instead of

401: the concentration $\phi$ assuming that all nodes have the same

402: geometrical volume.

403: The total delivered mass for a node is the following:

404: \begin{multline}

405: \Delta_{out} M = \sum_{i=1}^{k_{out}} \Delta_{out} M_i =  - \alpha \left ( M_0 - \frac{1}{k_{out}}

406: \sum_{i=1}^{k_{out}} M_i \right) =  \\

407: - \alpha ( M_0 - \overline{M} ),

408: \end{multline}

409: where $\overline{M}$ is the mean ink mass in the neighbouring

410: nodes with smaller masses. Mass transfer in the pipe happens

411: instantaneously.

412:  Thus we can apply mass conservation

413: law and increase mass in the neighbouring nodes by the amount

414: taken from the node:

415: \begin{eqnarray}

416: \Delta_{out} M & = & - \sum_{i=1}^{k_{out}} \Delta_{in} M_i \\

417: \Delta_{in} M  & = & - \sum_{i=1}^{k_{in}} \Delta_{out} M_i

418: \end{eqnarray}

419:

420: The total change of mass at a certain node is composed of the loss of mass due

421: to diffusion to the neighbours through outgoing links and gain of mass

422: by the

423: amount delivered from neighbors through

424: incoming links: $\Delta M = \Delta_{in} M + \Delta_{out} M$. This

425: conservation law is the extension of Kirchhoff's

426: law \cite{diestel:graph} for the node with non-zero capacity.

427:

428: In order to prevent inequality due to sequential nodes processing, mass changes

429: for all nodes were calculated without actual changing the masses and then

430: values of the masses in all nodes were updated. For the special case of absence of outgoing

431: links $\Delta_{out} M = 0$ the specific node acts as a virtual ink absorber which can

432: only gain ink from the neighbours but does not have ways to deliver it

433: back. Nodes without incoming links are not

434: considered due to their invisibility for the data collecting crawler and

435: thus are absent in our database.

436:

437: We start by putting an initial amount of ink of $M_0 = N$ mass

438:  units in one of the nodes which we call the \textit{initiator}. Subsequently system is

439: allowed to proceed to the equilibrium state by continuous mass

440: redistribution within the network according to our rules. The

441:  expectation for an

442: equilibrium state for a connected network system is equal

443: distribution of mass $M_0$ among the nodes so that each of

444: them ends up having $M_0 / N = 1$ mass units. While evolving to this state the system

445: passes through non-equilibrium states with non-uniform mass

446: distributions.

447:

448: Imagine a cluster of well connected nodes inside the network

449: connected to the outside world only by few outgoing and

450: incoming links. The ink diffusion inside the cluster is relatively fast due to the

451: presence of a large number of exchange channels between the

452: members and a high conductivity of the channels

453: ensemble. Limited number of channels going outside the cluster forms the

454: bottleneck for mass delivery. Under these conditions the flow rate between

455: the members is much higher than between the members and non-members and dispersed ink will

456: likely form an equi-concentrational \comment{Phys. Rev. E 49, 5431--5437

457:   (1994)} volume within the cluster.

458: Each cluster in this system with

459: specific connection properties such as flow rate and distance from

460: the initiator would have in each of its

461: nodes the same concentration of ink with the value specific to the

462: particular cluster. Thus by estimating the probability distribution

463: function of concentration one can analyze non-uniformity of ink

464: distribution and reveal separated clusters by determining the

465: signatures of equi-concentration volumes.

466:

467: \begin{figure}

468: \centering

469:   \includegraphics[width=1.05\linewidth]{fig4.eps}

470:   \caption{Dynamics of relative  concentration change in the

471:   initiator node {\em doctor\_livsy} for different flow rates

472:   $\alpha$. Inset shows rescaled data. Oscillatory parts were cut away.}

473:   \label{fig:concentration_decay}

474: \end{figure}

475:

476: The flow rate $\alpha$ from the equation \ref{eq:main} can be selected

477: from the half-interval (0;1] and defines the speed of

478:   simulation. Values larger than 0.5 are not desirable because they

479:   can cause concentration waves or back-reflections in some cases.

480:

481: The proposed method does not aim to decompose the whole network on

482: minimal clusters but to reveal significant clusters within the

483: network. As we regard the network as an open system which does

484: not have to be fully described by existing database we do not assign

485: measure of clustering of the whole network like modularity proposed by Newman

486: \cite{newman:mixing,newman:community}. However we can quantify the

487: isolation of the individual community $i$ by parameter of

488: confinement $K_i$ which is the characterization of assortative mixing of

489: individual community. We can define $K_i$ using notation of

490: Newman \cite{newman:mixing} as following:

491: \begin{equation}

492: K_i = \frac{e_{ii}}{\sum_j e_{ij}} = \frac{e_{ii}}{b_i},

493: \end{equation}

494: where $e_{ij}$ is the fraction of network edges connecting nodes of

495: the community $i$ to the community $j$ and $\sum_j e_{ij} = b_i$ is

496: the fraction of edges starting from the members of $i$. Thus

497: parameter $K_i$ defines the number of links connecting the nodes

498: inside the community $i$ as a fraction  of the total number of links

499: originating from the members of $i$.

500:

501: \begin{figure}

502: \centering

503: \includegraphics[width=\linewidth]{fig5.eps}

504: \caption{Probability distribution functions of virtual ink

505:   concentration $M$ at two stages of the diffusion process with $\alpha =

506:   0.1$ and {\em doctor\_livsy} as the initiator node. Inset represents

507:   the same data in linear scale. Two well pronounced peaks of two

508:   separated communities are clearly seen.}

509:  \label{fig:profiles}

510: \end{figure}

511:

512: \begin{figure}

513: \centering

514: \includegraphics[width=\linewidth]{fig6.eps}

515:     \caption{Dynamics of virtual ink distribution within LJ

516:       network as a logarithmically color coded probability

517:       distribution function of the ink concentration (vertical

518:       axis) and simulation step (horizontal axis). Separation of

519:       Russian-speaking community (thin upper line, high concentration

520:       values) from general English-speaking (thicker lower line, lower

521:       concentration values) can be clearly seen.}

522:     \label{fig:dynamics}

523: \end{figure}

524:

525: \section{RESULTS AND DISCUSSION}

526:

527: To test our method we performed ink diffusion simulations using our

528: LJ database starting from different initiator nodes.

529: Fig.~\ref{fig:concentration_decay} shows the relative mass decay as a

530: function of simulation step number $T$ for the flow rates $\alpha =

531: 0.1$, 0.25 and 0.5. User {\em doctor\_livsy} with a high number

532: of incoming links was chosen as the initiator node. As we will show later

533: this user belongs to extremely confined Russian-speaking community.

534: The inset of Fig.~\ref{fig:concentration_decay} shows

535: the same data rescaled with respect

536: to $\alpha$. As one can see from the match of rescaled curves the

537: dynamics of the process does not depend on the flow rate $\alpha$ in this

538: range. The striking feature of the presented data is the obvious

539: step-like form of the curves which is the effect of non-homogeneous

540: structure of the LJ network. Flat parts of the $\Delta M / M$ curves

541: correspond to the exponential decays of $M$ which is the

542: sign of non-restricted diffusion of ink. The first significant drop of the

543: decay rate happens when $T \alpha \approx 5$ which is equal to the

544: double radius of the community to which our initiator belongs. This

545: corresponds to the moment when virtual ink

546: fills the whole community and further expansion of filled area is

547: impeded by the limited number of links going outside the community.

548: So if it takes $T_0$ simulation steps for the virtual ink to reach the

549: borders of the community it also takes $T_0$ simulation steps for the

550: decay of concentration gradient to reach the initiator node and together this

551: gives double size of the community.

552: The second drop at $T \alpha \approx 22$ is not well pronounced and

553:  corresponds to the filling of the whole network.

554:

555: As our community discovering algorithm is based on the detection of

556: equi-concentration volumes we performed the calculation of the

557: probability distribution function of $M$ at two stages of

558: virtual ink diffusion for $\alpha = 0.1$ (Fig.\ref{fig:profiles}). One

559: can see two well

560: pronounced peaks on all plots which occurred to be the Russian speaking

561: community (larger values of mass $M$) and the rest of LJ network (broader peak at

562: smaller values of $M$).

563:

564: The dynamics of virtual ink distribution is presented in

565: the Fig.\ref{fig:dynamics}. As it can be seen a distinct separation of the

566: Russian community peak from the main peak is formed before step

567: $T \alpha = 50$. At the latter stage it is quite stable and easily distinguishable up to

568: iteration $T \alpha = 10^3$ which gives quite a long quasi-stationary stage

569: that can be used for communities detection. It also demonstrates that the

570: process of equi-concentrational volumes formation is much faster than the

571: relaxation of the whole system.

572:

573: If the initiator node is selected somewhere outside the community the

574: splitting of the distribution peak is also observed but for this case

575: average concentration within the Russian community is smaller compared

576: to the

577: rest of the LJ nodes. This supports the expectations that if the

578: community has a limited number of outgoing links it also lacks

579: incoming links.

580:

581: \begin{figure}

582: \centering

583: \includegraphics[width=\linewidth]{fig7.eps}

584:     \caption{Two-dimensional map of LJ users network obtained by

585:     concentration configurations of independent diffusion processes

586:     from two initiator nodes on the stage $T \alpha = 100$.}

587:     \label{fig:mapping}

588: \end{figure}

589:

590: \begin{table*}[t]

591: \caption{Examples of discovered communities within LiveJournal

592:   userspace.}

593: \label{tab:comms}

594: \begin{ruledtabular}

595: \begin{tabular}{ccccl}

596: %\hline

597: %\hline

598: Representing node & Number of users & Specificity & Confinement $K$ &

599: Comments \\

600: \hline

601: {\em doctor\_livsy} & 227314 & 99.89\% & 98.34\% & Russian speaking

602: community\footnote{92\% of users have Cyrillic letters in their

603:   information pages or journals} \\

604: {\em future\_visions} & 421 & 98.36\% & 96.22\% & Fandom High Role-Playing

605: Game community \\

606: {\em alected } & 262 &  99.21\% & 99.10\%  & Leviosa Role-Playing Game community \\

607: %\hline

608: %\hline

609: \end{tabular}

610: \end{ruledtabular}

611: \end{table*}

612:

613:

614: The accuracy of community discovering scheme can be improved by

615: simultaneous simulation of the diffusion from two or more initiator

616: nodes. Here we assigned two  independent concentration values to a

617: single node. All diffusion processes proceed without

618: influencing each other. The LJ network can now be mapped as a

619: probability distribution function of two concentrations and thus the

620: community can be localized on a two dimensional plot

621: as shown in the Fig.~\ref{fig:mapping} for {\em doctor\_livsy} and

622: {\em future\_visions} as the initiator nodes. One can see two main separated

623: peaks corresponding to the major part of LJ network and the Russian-speaking

624: community. The abundance of noise-like spots on the map corresponds to

625: the small well-separated and well linked communities existing in the

626: network which are well localized.

627:

628: The selection of nodes from a certain community can be performed by simple

629: thresholding the values of both concentrations. The group of nodes with the

630: concentration values within the selected range which form the  connected

631: component in the network can be identified as the community.

632: The ratio of the number of connected nodes to the total number of

633: users with concentrations within the range defines the

634: \textit{specificity} of the method.

635:

636: As the complete analysis of LJ community structure as well as the

637: reasons of their formation is out of the scope of the current paper we

638: will not list all user cliques found. However in the Tab.~\ref{tab:comms} we

639: list the largest LJ community and two smaller

640: ones together with their parameters. The size of discovered

641: Russian-speaking community is of the order of the total number of LJ

642: users from the Russian Federation according to LJ database statistics \cite{lj:stat}

643: ($232\;241$ users in January 2006). The obvious reason for the separation of

644: this community with a very high value of confinement $K = 98.34$\% is

645: the prevailing usage

646: of Russian language. We found by separate analysis of info pages and

647: journal entries that 92\% of the users within this community are using

648: Cyrillic alphabet. The fact that the Russian LJ community differs from

649: the rest of LJ network has been already

650: pointed out by Internet observers (e.g. Ref.~\cite{gorny:RLJ}).

651: The two other listed communities are the examples of surprisingly popular class of Role-Playing Game

652:  communities formed by the virtual users playing characters and

653: writing their journals on behalf of these characters.

654:

655: \section{CONCLUSIONS}

656:

657: The LiveJournal friendship network was studied with the general approach

658: developed for the complex networks and a power-law tail with exponent

659: $\gamma = 3.45$ was found in the degree distributions. This network

660: also demonstrated small-world property and high clustering.

661:

662: To study the community structure we utilized the original thermodynamic approach.

663: We found that diffusion in an essentially non-euclidean geometry

664: of a complex network with community structure leads to a peculiar

665: phenomenon of formation of quasi-stationary equi-concentration volumes

666: as shown by our simulation. This proves to be very useful

667: for the detection of well-interconnected groups of nodes. With a limited number of

668: parallel diffusion processes sufficient for a rough decomposition our method has an $O(N ln

669: N)$ complexity  (each simulation step analyzes

670: $M = \langle k \rangle N$ edges which for a sparse matrix $M = O(N)$

671: and the required number of steps is proportional to the

672: diameter of the network which is $O (ln N) $). It is currently one of

673: the fastest algorithms and was applied for a huge directed network of LJ users

674: containing several millions of nodes. To obtain results presented in this

675: paper it takes only one or two hours of

676: desktop computer time. Moreover this method can be applied locally to

677: a specific part of the network even with the lack of complete information about distant

678: parts of the network. The sensitivity of decomposition can be tuned by

679: increasing the number of initiator nodes with the limit of complete decomposition

680: when every node acts like initiator of its own diffusion process.

681:

682: \acknowledgments

683:

684: Financial support by the Swiss National Science Foundation is

685: gratefully acknowledged. We thank Frank Scheffold for helpful

686: discussion.

687:

688: \bibliography{/usr/share/texmf/bibtex/bib/base/full}

689: \end{document}

690: