physics0602063/lj.tex
1: \documentclass[twocolumn,a4paper,aps,pre,preprintnumbers]{revtex4}
2: %\usepackage[pdftex]{graphicx}
3: \usepackage{graphicx}
4: \usepackage{amssymb,amsmath,amsthm,subfigure,hyperref}
5: %\usepackage{lineno}
6: 
7: %\pdfcompresslevel=9
8: %\renewcommand{\textfraction}{0.10}
9: %\renewcommand{\floatpagefraction}{0.99}
10: 
11: %\pagestyle{fancy}
12: %\fancyhf{}
13: %\headheight 35pt
14: % Use the CVSID as the center footer
15: 
16: \newcommand{\comment}[1]{}
17: \newcommand{\grad}{\ensuremath{^\circ}}
18: %\newcommand{\comment}[1]{\emph{#1}}
19: %\renewcommand{\baselinestretch}{1.9}
20: \begin{document}
21: %\preprint{pre-final draft. Exact numbers and figures are subjects to change}
22: \title{Thermodynamic approach for community discovering within the
23:   complex networks: LiveJournal study.}
24: \author{Pavel Zakharov}
25: \affiliation{Department of Physics, University of Fribourg, CH-1700,
26:   Switzerland, email: Pavel.Zakharov@unifr.ch}
27: 
28: \date{\today}
29: 
30: \begin{abstract}
31: The thermodynamic approach of concentration mapping is used to
32: discover communities in the directional friendship network of LiveJournal
33: users. We show that this Internet-based social network has a power-law
34: region in degree distribution with exponent $\gamma = 3.45$. It is
35: also a small-world network with high clustering of nodes. To study the community structure we
36: simulate diffusion of a virtual substance immersed in such a network as in
37: a multi-dimensional porous system. By analyzing concentration profiles at
38: intermediate stage of the diffusion process the well-interconnected
39: cliques of users can be identified as nodes with equal values of concentration.
40: \end{abstract}
41: \pacs{89.75.Hc, 05.10.-a, 87.23.Ge, 89.20.Hh}
42: 
43: \maketitle
44: %\linenumbers
45: 
46: 
47: \begin{figure}
48: \centering
49:   \includegraphics[width=\linewidth]{fig1.eps}
50:   \caption{Probability density functions of in- and out-degrees for
51:   LiveJournal 
52:   users. The line shows a slope of -3.45 which equally well fits
53:   $P(k_{in})$ and $P(k_{out})$.}
54:   \label{fig:degrees}
55: \end{figure}
56: 
57: \section{INTRODUCTION}
58: 
59: In recent years there has been an enormous breakthrough in research of
60: complex networks due to the application of statistical
61: physics methodology \cite{albert:review,dorogovtsev:review,dorogovtsev03:book}. Many different complex
62: systems instead of being completely random prove to have signatures
63: of organization such as clustering and power-law distribution of
64: links. Together with the small-world property \cite{milgram:small} these are the inherent features
65: of an extremely wide variety of systems such as the World-Wide
66: Web \cite{albert:diameter,kleinberg:web99,kumar:web00},
67: Internet \cite{satorras:internet}, collaboration networks of movie
68: actors \cite{watts:1998,newman:random} and scientists \cite{newman:random}, the web of human
69: sexual contacts \cite{liljeros:2001} and many others. In spite of the
70: fact that some concepts of complex networks theory were originally
71: introduced in sociology the statistical study of social networks is
72: complicated by the difficulty in reliable data collection due to certain privacy and ethical reasons.
73:  One of the solutions for this problem is the analysis of collaboration
74: networks \cite{watts:1998,newman:random}, e-mail
75: interactions \cite{arenas:email,arenas:community,newman:email}, instant
76: messaging \cite{smith:2002} and online blogging \cite{kumar:bursty,kumar:structure,nowell:phd,nowell:pnas}.  
77: \comment{
78: Nowell {\em et al.} recently studied geographic aspects of LiveJournal
79: (www.livejournal.com) blog space  and they reported parabolic shape of
80: friendship degrees distributions \cite{nowell:pnas}. }
81: Here we studied basic structural properties of LiveJournal blog service
82: social network and demonstrated the diffusion-motivated method to
83: discover communities on the case of this network.
84: 
85: \section{LIVEJOURNAL NETWORK}
86: 
87: LiveJournal (LJ) is an online web-based journal service
88: with an emphasis on users interactions \cite{lj:faq}. In January 2006 it had $9.3 \cdot 10^6$
89: users in total, $2.0
90: \cdot 10^6$ of them were {\em active in some way} according to
91: official  LiveJournal statistics \cite{lj:stat}.
92: The essential feature of LJ service is the ``friends'' concept which helps
93: users to organize their reading preferences and 
94: provides security regulations for their journal entries and personal data. Friends list
95: is an open information and can be accessed through a conventional WWW
96: interface or through a dedicated bot interface provided by LJ system.
97: 
98: Data collection was performed by crawler programs running simultaneously
99: on two computers and exploring the LJ space by following directional
100: friendship links starting from two users with a large number of
101: incoming friendship links. For each user the crawler was obtaining his friends list (outgoing links) and the
102: number of users who have the given user in their friends list (incoming
103: links). Each user from the friends list which was not yet explored by
104: the crawler was added to the end of the processing queue if he was not already
105: there. If the user was in the queue his queue score was
106: incremented every time he was found in someones' friends list. Users
107: with higher queue scores were processed first. This ensured fast
108: collection of the essential part of the network. Basically this algorithm 
109: is a modification of Tarjan's depth-first search algorithm for
110: finding the connected component of a graph \cite{tarjan:alg72,hopcroft:alg73}.
111: Total time of collection was 14 days with the total number of 
112: discovered users $3\:746\:264$ found in a connected component. We are aware
113: that during the time of collection the network was 
114: undergoing continuous changes. We estimated the number of users deleted from the LJ
115: database but still present in the friends lists was less than 0.1\%
116: which makes us believe that the evolution of LJ network did not
117: influence our statistics much.
118: 
119: The estimated probability distribution functions of in- and out-degree are presented in
120: log-log scale in the Fig.~\ref{fig:degrees}.  The estimated mean
121: of the numbers of  outgoing  and incoming friendship links is $\langle k_{out}\rangle =
122: 15.91$ and $\langle k_{in}\rangle = 16.07$, correspondingly. The
123: average in-to-out ratio $\langle k_{in} / k_{out}\rangle =
124: 1.157$. The number of incoming links is slightly larger than the
125: number of outgoing due to the fact that only the outgoing links were
126: used for crawler navigation so some of the LJ users were unreachable by
127: directional links but they were listed in the users pages. 
128: 
129: There are also several technical restrictions for the degrees: maximum
130: number of friends per user is 750 and only 150 of them can be listed on the users' info
131: page and can be effortlessly accessed by the LJ users. From our experience LJ bots interface
132: does have some problems listing the users who consider a certain
133: user as a friend if there are more than 2500 of them hence we cut the
134: data at $k_{out\: max} = 2500$. 
135: 
136:  As one can see from the Fig.~\ref{fig:degrees} in- and out-degree distributions
137: reveal a power-law decay $P(k) \sim k^{-\gamma} $ for $k > 100$ with 
138: the value of the exponent $\gamma_{in} \approx \gamma_{out} =  3.45 \pm 0.05$ which is
139: surprisingly close to the values $\gamma_{in} \approx \gamma_{out} \approx
140: 3.4$ obtained by Liljeros {\em et al.} for sexual
141: contacts \cite{liljeros:2001}. Scaling of the distributions contradicts the
142: results of Nowell {\em et al.} \cite{nowell:pnas}
143: who reported parabolic shape of LJ degrees distributions.
144: The skewness of 
145: the distributions in our case can be explained by the social origin of
146: LJ network. As it is pointed out by Jin {\em et
147:   al.} \cite{jin:structure} degree distribution for social
148: networks does not appear to follow power-law distribution due to the
149: cost in terms of time and efforts to support friendship. In the case of
150: LJ network the cost of friendship is the size of friends feed which
151: accumulates all the recent entries of the user's friends. We can also
152: separate two classes of LJ users: ``readers'' and ``writers''. The first are
153: mainly using their accounts to read the journals of others. They update
154: journals only episodically and are not deeply involved in LJ
155: community life. They do not have many incoming and outgoing links and
156: they are responsible for skewness of the distributions for $k <
157: 100$. Meanwhile active ``writers'', representing minority of the registered users 
158: exploit full capacity of LJ system. They spend much time participating
159: in LJ community life, and they have a larger number of incoming and
160: outgoing links which are distributed by power-law.
161: 
162: The origin of power-law region in the distributions can be explained by
163: continuous evaluation and self-organization of the LJ network and preferential attachment
164: mechanism similar to the general WWW growth mechanism
165: \cite{barabasi:1999}. One an interesting journal gets popular it will
166: be cited and promoted in 
167: the journals of its readers which will help to further increase its
168: popularity which leads to a ``rich-get-richer'' effect occurring
169: in many network systems
170: \cite{barabasi:1999,dorogovtsev:review}. However linear growth
171: with linear preferential attachment protocol leads to a power-law
172: degree distribution with $\gamma = 3$ which is smaller than the exponent
173: obtained for our study. Larger values of exponent can be
174: explained by alternative growth mechanisms: preferential attachment
175: with rewiring \cite{albert:topology00} and copying mechanism
176: \cite{kleinberg:web99,kumar:web00}. Rewiring in LJ system implies that
177: users are not only establishing new friendship links but also breaking
178: the old ones while copying occurs when the user inherits part of the
179: friendship connections of his friends. The latter effect is called
180: ''transitivity'' in sociology \cite{wasserman:94} and is responsible for
181: users cliques formation or clustering.
182: 
183: We characterize clustering of LJ users by calculating the clustering
184: coefficient as introduced by Watts and Strogatz 
185: \cite{watts:1998,albert:review}. It is defined as the number of links between user's friends divided
186: by the maximum possible number of links between them averaged over all
187: users in the network. If the user $i$ has $k_i$
188: friends with $E_i$ links between them the maximum possible number of
189: directed links is $k_i (k_i - 1)$ and the clustering coefficient for the
190: user $i$ in the case of directed network can be defined as:
191: \begin{equation}
192: C_i = \frac{E_i}{k_i (k_i - 1)}.
193: \label{eq:clustering}
194: \end{equation}
195: The average clustering coefficient for the whole network as calculated
196: from our data is: $C = \langle C_i \rangle_{i=1..N} \approx 0.3302$. It is worth to
197: compare this value to the clustering coefficient of a random
198: directional Erd\H{o}s-R\'{e}nyi graph which can be found as $C_{rand}
199: = \langle k \rangle / (N - 1)$ which for LJ network is ca. $4.24 \cdot 10^{-6}$. The fact that actual 
200: clustering coefficient for LJ network is nearly five orders of magnitude larger than
201: it would be expected from randomly linked network with the same degree and
202: size is a clear indication of high user clustering.
203: 
204: 
205: The peculiar feature of the LJ network is the high 
206: reciprocity \cite{wasserman:94} of friendship links. We found that 79.26\% of links 
207: are bi-directional which means that this percentage of outgoing links 
208: is returned as incoming and {\em vice versa}
209: the same percentage of incoming links originates from users friends.
210: This value is higher than reciprocity 57\% found for the WWW
211: \cite{newman:email02} which is the technical environment of LJ. Increasing
212: of reciprocity may be explained by social origin of LJ network. Due to
213: the rules of social interactions user $A$ usually feels obliged to establish a
214: friendship connection to the user $B$ if such a connection was already
215: established by $B$ to $A$. Another explanation for high reciprocity is
216: that often relations in  
217: LJ space is based on real-life people relations which means that
218: LJ users are linking to the other users which are their friends in the real
219: world. In this case the LJ network directly inherits the undirectional
220: structure of the underlying social network.
221: 
222: 
223: \begin{figure}
224: \centering
225:   \includegraphics[width=\linewidth]{fig2.eps}
226:   \caption{Probability distribution function of the minimum path length
227:     between LiveJournal users through the directional friends links.}
228:   \label{fig:path_distance}
229: \end{figure}
230: 
231: 
232: In order to characterize small-world properties of LJ network we estimated the
233: probability distribution function $P_\ell(\ell)$ of the minimum path distance
234: or hopcount between the nodes through directional links. The results are presented in
235: the Fig.~\ref{fig:path_distance}. The average distance estimated for
236: our set of data is $\langle \ell \rangle = 5.86$. 
237: \comment{
238: According to the
239: general approach developed by Newman \textit{et al.}
240: \cite{newman:random} an average path length can be estimated using the
241: following expression:
242: \begin{equation}
243: \mathcal{\ell} = \frac{ln(N / z_1)}{ln(z_2/z_1)} + 1,
244: \end{equation}
245: where $N$ is the size of the network and $z_1 = \langle k_{out} \rangle $ and $z_2$ is the number
246: of the first and the second neighbours. From this we obtained $\ell \approx
247: 4.3$ which is significantly smaller than the value obtained from
248: the distribution. We are considering this as a first sign of structure
249: within LJ network. 
250: }
251: Based on the recently obtained expression for the mean distance between
252: the nodes in scale-free networks by Hooghiemstra \textit{et al.} \cite{hoog:mean05}
253: who improved the widely used result of Newman \textit{et
254:   al.} \cite{newman:random} the value of $\langle \ell \rangle$ can be estimated
255: as the following:
256: \begin{equation}
257: \langle \mathcal{\ell} \rangle_{th} \approx \frac{ln N}{ln \nu} + \frac{1}{2} -
258: \left ( \frac{\gamma_e + ln \mu - ln (\nu - 1)}{ln \nu} \right ) - 2
259: \frac{\epsilon}{log \nu},
260: \label{eq:hoog}
261: \end{equation}
262: where $N$ is the size of the network, $\mu = \langle k \rangle$,
263: $\nu = \langle k (k - 1)\rangle / \langle k \rangle$, $\gamma_e
264: \approx 0.577$ is the Euler-Mascheroni constant, and $\epsilon$ is the
265: expectation of the logarithm of the limit of a super-critical
266: branching process which depends on the scaling exponent $\gamma$ 
267: and belongs to the half-open interval $(-1,0]$, where the lower
268:   boundary is the numerical extrapolation of the results from
269:   \cite{hoog:mean05} and the upper boundary the theoretical prediction
270:   for $\gamma > 3$.
271: 
272: For LJ data the equation \eqref{eq:hoog} gives the following range of the mean distance:
273: $ 4.53 \le \langle \mathcal{\ell} \rangle_{th} < 5.05$ which is in any case
274: smaller than statistically obtained value. This theoretical prediction
275: assumes the homogeneity of the graph, and  we believe the possible reason
276: for such an underestimation of the mean path length is the 
277: macroscopic structuring of the network which is discussed further.
278: 
279: \begin{figure}
280: \centering
281:   \includegraphics[width=\linewidth]{fig3.eps}
282:   \caption{Illustration of the community detection algorithm. After
283:     diffusion process starts from the initiator node virtual ink
284:   propagates through network links. Communities can be recognized as the
285:     groups of nodes with similar amount of ink.}
286:   \label{fig:scheme}
287: \end{figure}
288: 
289: 
290: \section{COMMUNITY DISCOVERING METHOD}
291: 
292: It seems to be quite natural for the nodes of the complex networks
293: to aggregate into macroscopic structures with high internal links
294: density and weak connection to the rest of the network. Such groups are
295: often referred to as communities. Particular reasons for communities 
296: formation may depend on the type of the network but this feature
297: proved to be quite universal and can be found in social, biological and
298: computer networks \cite{girvan:pnas02,newman:fast}. Finding these
299: structures within the network is the major step towards understanding its topology.
300: 
301: This problem is known as a graph-partitioning problem in graph theory
302: and has a nondeterministic polynomial (NP) complexity  which makes it
303: almost inapplicable for large networks. 
304: 
305: Recent advances in the study of complex networks stimulated the search
306: of alternative techniques for community discovering and many original solutions
307: were proposed
308: \cite{girvan:pnas02,newman:mixing,newman:community,newman:fast,clauset:2004,pons:rw05,wu:physics04,simonsen:diff04}. 
309: These algorithms can be divided into two main classes: \textit{divisible}, which
310: hierarchically split the network by removing edges with the highest
311: betweenness \cite{girvan:pnas02,newman:community} and
312: \textit{agglomerative} which start from the maximal community
313: division when each node belongs to its own separate community and
314: continuously merge these communities basing on some parameter of
315: nodes similarity \cite{wu:physics04,pons:rw05} or optimizing
316: the partitioning. In their recent work Clauset \textit{et al.}
317: \cite{clauset:2004} used the greedy optimization in order to maximize
318: the \textit{modularity} measure of partitioning quality
319: \cite{newman:community,newman:fast}. Currently this method is one of the
320: fastest and runs in time $O (M H ln N)$, where $M = \langle k \rangle N$ is the number
321: of edges in the network and $H$ is the number of decomposition levels
322: which is usually small ($H = O (ln N)$)
323: \cite{clauset:2004,pons:rw05}. In a sparse network the degree is limited
324: and $M = O (N)$ and so the complexity is $O (N ln^2 N)$ which makes it
325: fastest nowadays.
326: 
327: Here we propose a method to find communities based oh the principles
328: of thermodynamics. When the system gets large enough so that the behavior
329: of its microscopic constituents can be successfully averaged to give
330: basis for a scientific descriptions of phenomena with avoidance of
331: microscopic details. Since in thermodynamics behavior of the system can
332: be described without solving the equation of motion of every
333: constituent molecule we believe that structure of the large complex
334: network can be explored without explicit solution of graph
335: partitioning problem.
336: 
337: Our current study is based on the simulation of a mass diffusion process in the complex network
338: as in a multi-dimensional porous system with directional links following
339: physical laws. The diffusion process initiated at one of the nodes by
340: addition of the virtual ink produces a non-uniform mass distribution at the intermediate state
341: which can be used to reveal well-interconnected communities within the
342: complex network by selecting the nodes with similar concentration
343: values. In this sense our method falls in the class of agglomerative
344: techniques with the concentration as the similarity measure. However, it
345: can be shown that the quantity $r_{AB} = | ln \phi_A - ln \phi_B |$,
346: where $\phi_A$ and $\phi_B$ are two values of concentration  in the nodes $A$
347: and $B$, as the measure of distance between these nodes. Thus
348: edge betweenness, characterized as the drop of the logarithm of concentration
349: along the edge, can be used for hierarchical decomposition of the
350: network.
351: 
352: The similar measure of distance between nodes based on the random walk
353: has been recently introduced by Pons \textit{et al.} \cite{pons:rw05}
354: for the class of undirected networks. It is defined as the difference in probabilities
355: for a random walker to reach nodes the $A$ 
356: and $B$ in certain number of steps $t$ starting from some node
357: $Z$. As these probabilities for a large $t$ are mainly determined by
358: the in-degrees of the nodes the values of distance should be normalized
359: A short number of steps $t$ may depend on a particular
360: network and should be known in advance. Pons \textit{et al.} also pointed out
361: conceptual difficulties of the random walk scheme application for the directed
362: networks \cite{pons:rw05}. Several other diffusion motivated
363: approaches proposed recently (\textit{e.g.}
364: \cite{wu:physics04,simonsen:diff04,fouss:novel05}) are more or
365: less consistent with random-walk analogy.
366: 
367: In our model we break the similarity with classical random
368: walks and the theory of flows in the graph \cite{diestel:graph} in
369: favour of a realistic physical picture. First, we allow nodes to
370: accumulate substance by assigning to them infinite maximum capacity. 
371: The direct flow from the node $A$ to the node $B$ is possible if there is
372: a directed link from $A$ to $B$ and $\phi_A > \phi_B$. The flow rate
373: in this case depends on the concentration difference $\phi_B - \phi_A > 0$ and the out degree
374: $k_{out}$ of the node $A$. In the case of $ A < B$ no mass is
375: delivered directly from $A$ to $B$. Such rules in the limit of
376: infinite time lead to equilibrium state with equal mass distribution
377: which meets the physical expectations. 
378: 
379: Network links in our realization represent pipes (Fig.~\ref{fig:scheme}), directed
380: links  act as pipes allowing mass to pass in one direction. Mass
381: propagation within the network system is driven by Flick's law of diffusion:
382: \begin{equation}
383: %J  = - D \frac{\delta \phi}{\delta x}
384: dM = - D \frac{\delta \phi}{\delta x} dS dt, 
385: \end{equation}
386: where $dM$ is mass change, $\delta \phi / \delta x$ is
387: concentration gradient and $dS$ is an area element.
388: 
389: For our discrete system this implies that the rate of mass
390: exchange between the neighbouring nodes is proportional to the difference of masses in these
391: nodes. Every node uses its outgoing links to deliver mass to its neighbors with
392: a smaller amount of ink. The amount of ink $\Delta_{out} M_i$
393: delivered by the node to its $i$th neighbour is:
394: \begin{equation}
395: \Delta_{out} M_i = - \frac{\alpha}{k_{out}} (M_0 - M_i),
396: \label{eq:main}
397: \end{equation}
398: where $M_0 > M_i$ and $\alpha$ is the coefficient determining the
399: transfer rate and is constant for all
400: nodes. We analyze the mass $M$ contained in the node instead of
401: the concentration $\phi$ assuming that all nodes have the same
402: geometrical volume.
403: The total delivered mass for a node is the following:
404: \begin{multline}
405: \Delta_{out} M = \sum_{i=1}^{k_{out}} \Delta_{out} M_i =  - \alpha \left ( M_0 - \frac{1}{k_{out}}
406: \sum_{i=1}^{k_{out}} M_i \right) =  \\
407: - \alpha ( M_0 - \overline{M} ),
408: \end{multline}
409: where $\overline{M}$ is the mean ink mass in the neighbouring
410: nodes with smaller masses. Mass transfer in the pipe happens
411: instantaneously.
412:  Thus we can apply mass conservation
413: law and increase mass in the neighbouring nodes by the amount
414: taken from the node:
415: \begin{eqnarray}
416: \Delta_{out} M & = & - \sum_{i=1}^{k_{out}} \Delta_{in} M_i \\
417: \Delta_{in} M  & = & - \sum_{i=1}^{k_{in}} \Delta_{out} M_i 
418: \end{eqnarray}
419: 
420: The total change of mass at a certain node is composed of the loss of mass due
421: to diffusion to the neighbours through outgoing links and gain of mass
422: by the
423: amount delivered from neighbors through
424: incoming links: $\Delta M = \Delta_{in} M + \Delta_{out} M$. This
425: conservation law is the extension of Kirchhoff's
426: law \cite{diestel:graph} for the node with non-zero capacity.
427: 
428: In order to prevent inequality due to sequential nodes processing, mass changes
429: for all nodes were calculated without actual changing the masses and then
430: values of the masses in all nodes were updated. For the special case of absence of outgoing
431: links $\Delta_{out} M = 0$ the specific node acts as a virtual ink absorber which can
432: only gain ink from the neighbours but does not have ways to deliver it
433: back. Nodes without incoming links are not 
434: considered due to their invisibility for the data collecting crawler and
435: thus are absent in our database.
436: 
437: We start by putting an initial amount of ink of $M_0 = N$ mass
438:  units in one of the nodes which we call the \textit{initiator}. Subsequently system is
439: allowed to proceed to the equilibrium state by continuous mass
440: redistribution within the network according to our rules. The
441:  expectation for an 
442: equilibrium state for a connected network system is equal
443: distribution of mass $M_0$ among the nodes so that each of 
444: them ends up having $M_0 / N = 1$ mass units. While evolving to this state the system
445: passes through non-equilibrium states with non-uniform mass
446: distributions.
447: 
448: Imagine a cluster of well connected nodes inside the network
449: connected to the outside world only by few outgoing and
450: incoming links. The ink diffusion inside the cluster is relatively fast due to the
451: presence of a large number of exchange channels between the 
452: members and a high conductivity of the channels
453: ensemble. Limited number of channels going outside the cluster forms the
454: bottleneck for mass delivery. Under these conditions the flow rate between
455: the members is much higher than between the members and non-members and dispersed ink will
456: likely form an equi-concentrational \comment{Phys. Rev. E 49, 5431--5437
457:   (1994)} volume within the cluster. 
458: Each cluster in this system with
459: specific connection properties such as flow rate and distance from
460: the initiator would have in each of its
461: nodes the same concentration of ink with the value specific to the 
462: particular cluster. Thus by estimating the probability distribution
463: function of concentration one can analyze non-uniformity of ink
464: distribution and reveal separated clusters by determining the
465: signatures of equi-concentration volumes.
466: 
467: \begin{figure}
468: \centering
469:   \includegraphics[width=1.05\linewidth]{fig4.eps}
470:   \caption{Dynamics of relative  concentration change in the
471:   initiator node {\em doctor\_livsy} for different flow rates
472:   $\alpha$. Inset shows rescaled data. Oscillatory parts were cut away.}
473:   \label{fig:concentration_decay}
474: \end{figure}
475: 
476: The flow rate $\alpha$ from the equation \ref{eq:main} can be selected
477: from the half-interval (0;1] and defines the speed of
478:   simulation. Values larger than 0.5 are not desirable because they
479:   can cause concentration waves or back-reflections in some cases.
480: 
481: The proposed method does not aim to decompose the whole network on
482: minimal clusters but to reveal significant clusters within the
483: network. As we regard the network as an open system which does
484: not have to be fully described by existing database we do not assign
485: measure of clustering of the whole network like modularity proposed by Newman
486: \cite{newman:mixing,newman:community}. However we can quantify the
487: isolation of the individual community $i$ by parameter of 
488: confinement $K_i$ which is the characterization of assortative mixing of
489: individual community. We can define $K_i$ using notation of
490: Newman \cite{newman:mixing} as following:
491: \begin{equation}
492: K_i = \frac{e_{ii}}{\sum_j e_{ij}} = \frac{e_{ii}}{b_i},
493: \end{equation}
494: where $e_{ij}$ is the fraction of network edges connecting nodes of
495: the community $i$ to the community $j$ and $\sum_j e_{ij} = b_i$ is
496: the fraction of edges starting from the members of $i$. Thus 
497: parameter $K_i$ defines the number of links connecting the nodes
498: inside the community $i$ as a fraction  of the total number of links
499: originating from the members of $i$.
500: 
501: \begin{figure}
502: \centering
503: \includegraphics[width=\linewidth]{fig5.eps}
504: \caption{Probability distribution functions of virtual ink
505:   concentration $M$ at two stages of the diffusion process with $\alpha =
506:   0.1$ and {\em doctor\_livsy} as the initiator node. Inset represents
507:   the same data in linear scale. Two well pronounced peaks of two
508:   separated communities are clearly seen.}
509:  \label{fig:profiles}
510: \end{figure}
511: 
512: \begin{figure}
513: \centering
514: \includegraphics[width=\linewidth]{fig6.eps}
515:     \caption{Dynamics of virtual ink distribution within LJ
516:       network as a logarithmically color coded probability
517:       distribution function of the ink concentration (vertical 
518:       axis) and simulation step (horizontal axis). Separation of
519:       Russian-speaking community (thin upper line, high concentration
520:       values) from general English-speaking (thicker lower line, lower
521:       concentration values) can be clearly seen.}
522:     \label{fig:dynamics}
523: \end{figure}
524: 
525: \section{RESULTS AND DISCUSSION}
526: 
527: To test our method we performed ink diffusion simulations using our
528: LJ database starting from different initiator nodes.
529: Fig.~\ref{fig:concentration_decay} shows the relative mass decay as a 
530: function of simulation step number $T$ for the flow rates $\alpha =
531: 0.1$, 0.25 and 0.5. User {\em doctor\_livsy} with a high number
532: of incoming links was chosen as the initiator node. As we will show later
533: this user belongs to extremely confined Russian-speaking community.
534: The inset of Fig.~\ref{fig:concentration_decay} shows 
535: the same data rescaled with respect 
536: to $\alpha$. As one can see from the match of rescaled curves the
537: dynamics of the process does not depend on the flow rate $\alpha$ in this
538: range. The striking feature of the presented data is the obvious
539: step-like form of the curves which is the effect of non-homogeneous
540: structure of the LJ network. Flat parts of the $\Delta M / M$ curves
541: correspond to the exponential decays of $M$ which is the
542: sign of non-restricted diffusion of ink. The first significant drop of the
543: decay rate happens when $T \alpha \approx 5$ which is equal to the
544: double radius of the community to which our initiator belongs. This
545: corresponds to the moment when virtual ink 
546: fills the whole community and further expansion of filled area is
547: impeded by the limited number of links going outside the community.
548: So if it takes $T_0$ simulation steps for the virtual ink to reach the
549: borders of the community it also takes $T_0$ simulation steps for the
550: decay of concentration gradient to reach the initiator node and together this
551: gives double size of the community.
552: The second drop at $T \alpha \approx 22$ is not well pronounced and
553:  corresponds to the filling of the whole network. 
554: 
555: As our community discovering algorithm is based on the detection of
556: equi-concentration volumes we performed the calculation of the
557: probability distribution function of $M$ at two stages of
558: virtual ink diffusion for $\alpha = 0.1$ (Fig.\ref{fig:profiles}). One
559: can see two well 
560: pronounced peaks on all plots which occurred to be the Russian speaking
561: community (larger values of mass $M$) and the rest of LJ network (broader peak at
562: smaller values of $M$). 
563: 
564: The dynamics of virtual ink distribution is presented in
565: the Fig.\ref{fig:dynamics}. As it can be seen a distinct separation of the
566: Russian community peak from the main peak is formed before step
567: $T \alpha = 50$. At the latter stage it is quite stable and easily distinguishable up to
568: iteration $T \alpha = 10^3$ which gives quite a long quasi-stationary stage
569: that can be used for communities detection. It also demonstrates that the
570: process of equi-concentrational volumes formation is much faster than the
571: relaxation of the whole system. 
572: 
573: If the initiator node is selected somewhere outside the community the
574: splitting of the distribution peak is also observed but for this case
575: average concentration within the Russian community is smaller compared
576: to the
577: rest of the LJ nodes. This supports the expectations that if the
578: community has a limited number of outgoing links it also lacks 
579: incoming links.
580: 
581: \begin{figure}
582: \centering
583: \includegraphics[width=\linewidth]{fig7.eps}
584:     \caption{Two-dimensional map of LJ users network obtained by
585:     concentration configurations of independent diffusion processes
586:     from two initiator nodes on the stage $T \alpha = 100$.}
587:     \label{fig:mapping}
588: \end{figure}
589: 
590: \begin{table*}[t]
591: \caption{Examples of discovered communities within LiveJournal
592:   userspace.}
593: \label{tab:comms}
594: \begin{ruledtabular}
595: \begin{tabular}{ccccl}
596: %\hline
597: %\hline
598: Representing node & Number of users & Specificity & Confinement $K$ &
599: Comments \\
600: \hline
601: {\em doctor\_livsy} & 227314 & 99.89\% & 98.34\% & Russian speaking
602: community\footnote{92\% of users have Cyrillic letters in their
603:   information pages or journals} \\
604: {\em future\_visions} & 421 & 98.36\% & 96.22\% & Fandom High Role-Playing
605: Game community \\
606: {\em alected } & 262 &  99.21\% & 99.10\%  & Leviosa Role-Playing Game community \\
607: %\hline
608: %\hline
609: \end{tabular}
610: \end{ruledtabular}
611: \end{table*}
612: 
613: 
614: The accuracy of community discovering scheme can be improved by
615: simultaneous simulation of the diffusion from two or more initiator 
616: nodes. Here we assigned two  independent concentration values to a
617: single node. All diffusion processes proceed without
618: influencing each other. The LJ network can now be mapped as a 
619: probability distribution function of two concentrations and thus the
620: community can be localized on a two dimensional plot 
621: as shown in the Fig.~\ref{fig:mapping} for {\em doctor\_livsy} and
622: {\em future\_visions} as the initiator nodes. One can see two main separated
623: peaks corresponding to the major part of LJ network and the Russian-speaking
624: community. The abundance of noise-like spots on the map corresponds to
625: the small well-separated and well linked communities existing in the
626: network which are well localized.  
627: 
628: The selection of nodes from a certain community can be performed by simple
629: thresholding the values of both concentrations. The group of nodes with the
630: concentration values within the selected range which form the  connected
631: component in the network can be identified as the community.
632: The ratio of the number of connected nodes to the total number of
633: users with concentrations within the range defines the
634: \textit{specificity} of the method. 
635: 
636: As the complete analysis of LJ community structure as well as the
637: reasons of their formation is out of the scope of the current paper we
638: will not list all user cliques found. However in the Tab.~\ref{tab:comms} we
639: list the largest LJ community and two smaller
640: ones together with their parameters. The size of discovered
641: Russian-speaking community is of the order of the total number of LJ 
642: users from the Russian Federation according to LJ database statistics \cite{lj:stat}
643: ($232\;241$ users in January 2006). The obvious reason for the separation of
644: this community with a very high value of confinement $K = 98.34$\% is
645: the prevailing usage
646: of Russian language. We found by separate analysis of info pages and
647: journal entries that 92\% of the users within this community are using
648: Cyrillic alphabet. The fact that the Russian LJ community differs from
649: the rest of LJ network has been already
650: pointed out by Internet observers (e.g. Ref.~\cite{gorny:RLJ}).
651: The two other listed communities are the examples of surprisingly popular class of Role-Playing Game
652:  communities formed by the virtual users playing characters and 
653: writing their journals on behalf of these characters.  
654: 
655: \section{CONCLUSIONS}
656: 
657: The LiveJournal friendship network was studied with the general approach
658: developed for the complex networks and a power-law tail with exponent
659: $\gamma = 3.45$ was found in the degree distributions. This network
660: also demonstrated small-world property and high clustering.
661: 
662: To study the community structure we utilized the original thermodynamic approach.
663: We found that diffusion in an essentially non-euclidean geometry
664: of a complex network with community structure leads to a peculiar
665: phenomenon of formation of quasi-stationary equi-concentration volumes
666: as shown by our simulation. This proves to be very useful
667: for the detection of well-interconnected groups of nodes. With a limited number of
668: parallel diffusion processes sufficient for a rough decomposition our method has an $O(N ln
669: N)$ complexity  (each simulation step analyzes
670: $M = \langle k \rangle N$ edges which for a sparse matrix $M = O(N)$
671: and the required number of steps is proportional to the 
672: diameter of the network which is $O (ln N) $). It is currently one of
673: the fastest algorithms and was applied for a huge directed network of LJ users 
674: containing several millions of nodes. To obtain results presented in this 
675: paper it takes only one or two hours of 
676: desktop computer time. Moreover this method can be applied locally to
677: a specific part of the network even with the lack of complete information about distant
678: parts of the network. The sensitivity of decomposition can be tuned by
679: increasing the number of initiator nodes with the limit of complete decomposition
680: when every node acts like initiator of its own diffusion process.
681: 
682: \acknowledgments
683: 
684: Financial support by the Swiss National Science Foundation is
685: gratefully acknowledged. We thank Frank Scheffold for helpful
686: discussion. 
687: 
688: \bibliography{/usr/share/texmf/bibtex/bib/base/full}
689: \end{document} 
690: