0501:cond-mat0501386/yule.tex

1: %% ****** Start of file template.aps ****** %

2: %%

3: %%

4: %%   This file is part of the APS files in the REVTeX 4 distribution.

5: %%   Version 4.0 of REVTeX, August 2001

6: %%

7: %%

8: %%   Copyright (c) 2001 The American Physical Society.

9: %%

10: %%   See the REVTeX 4 README file for restrictions and more information.

11: %%

12: %

13: % This is a template for producing manuscripts for use with REVTEX 4.0

14: % Copy this file to another name and then work on that file.

15: % That way, you always have this original template file to use.

16: %

17: % Group addresses by affiliation; use superscriptaddress for long

18: % author lists, or if there are many overlapping affiliations.

19: % For Phys. Rev. appearance, change preprint to twocolumn.

20: % Choose pra, prb, prc, prd, pre, prl, prstab, or rmp for journal

21: %  Add 'draft' option to mark overfull boxes with black boxes

22: %  Add 'showpacs' option to make PACS codes appear

23: %  Add 'showkeys' option to make keywords appear

24: \documentclass[aps,pra,twocolumn,groupedaddress,showkeys]{revtex4}

25: %\documentclass[aps,prl,preprint,superscriptaddress]{revtex4}

26: %\documentclass[aps,rmp,twocolumn,groupedaddress]{revtex4}

27:

28: % You should use BibTeX and apsrev.bst for references

29: % Choosing a journal automatically selects the correct APS

30: % BibTeX style file (bst file), so only uncomment the line

31: % below if necessary.

32: %\bibliographystyle{apsrev}

33: \bibliographystyle{plain}

34:

35: %\usepackage{graphicx}% Include figure files

36: \usepackage{graphics}% Include figure files

37:

38: \begin{document}

39:

40: % Use the \preprint command to place your local institutional report

41: % number in the upper righthand corner of the title page in preprint mode.

42: % Multiple \preprint commands are allowed.

43: % Use the 'preprintnumbers' class option to override journal defaults

44: % to display numbers if necessary

45: %\preprint{}

46:

47: %Title of paper

48: \title{Bipartite Yule Processes in Collections of Journal Papers}

49:

50: \author{Steven A. Morris}

51: \email[]{steven.a.morris@okstate.edu}

52: \homepage[]{http://samorris.ceat.okstate.edu}

53: %\thanks{}

54: %\altaffiliation{}

55: \affiliation{

56: Oklahoma State University\\

57: Electrical and Computer Engineering\\

58: Stillwater, OK 74078, USA }

59:

60: \date{\today}

61:

62: \begin{abstract}

63: Collections of journal papers, often referred to as 'citation

64: networks', can be modeled as a collection of coupled bipartite

65: networks which tend to exhibit linear growth and preferential

66: attachment as papers are added to the collection. Assuming primary

67: nodes in the first partition and secondary nodes in the second

68: partition, the basic bipartite Yule process assumes that as each

69: primary node is added to the network, it links to multiple secondary

70: nodes, and with probability, $\alpha$, each new link may connect to

71: a newly appearing secondary node. The number of links from a new

72: primary node follows some distribution that is a characteristic of

73: the specific network. Links to existing secondary nodes follow a

74: preferential attachment rule. With modifications to adapt to

75: specific networks, bipartite Yule processes simulate networks that

76: can be validated against actual networks using a wide variety of

77: network metrics. The application of bipartite Yule processes to the

78: simulation of paper-reference networks and paper-author networks is

79: demonstrated and simulation results are shown to mimic networks from

80:  actual collections of papers across several network metrics.

81: \end{abstract}

82:

83: % insert suggested PACS numbers in braces on next line

84: \pacs{02.50.Ey, 87.23.Ge, 89.75.Hc}

85: % insert suggested keywords - APS authors don't need to do this

86: \keywords{bipartite networks, citation networks, Yule process,

87: Simon-Yule process, network growth model, preferential attachment }

88:

89: %\maketitle must follow title, authors, abstract, \pacs, and \keywords

90: \maketitle

91:

92: % body of paper here - Use proper section commands

93: % References should be done using the \cite, \ref, and \label commands

94: \section{Collections of papers as coupled bipartite networks}

95:

96: As shown in Figure 1, a collection of journal papers constitutes a

97: series of coupled bipartite networks \cite{morris05}. As diagrammed

98: in Figure 1, a collection of papers contains 6 direct bipartite

99: networks: 1) papers to paper authors, 2) papers to references, 3)

100: papers to paper journals, 4) papers to terms, 5) references to

101: reference authors, and 6) references to reference journals.

102: Additionally, there are 15 indirect bipartite networks in

103: collections of papers as defined by the diagram. Examples of

104: interesting indirect networks are paper author to reference author

105: networks, and paper journal to reference journal networks, which can

106: be used for author co-citation analysis \cite{white81} and journal

107: co-citation analysis \cite{mccain91} respectively.

108:

109: \begin{figure}

110: \resizebox{0.45\textwidth}{!}{%

111: \includegraphics{fred.eps}}%

112: \caption{Diagram showing a collection of papers as a series of

113: coupled bipartite networks.\label{coupled}}

114: \end{figure}

115:

116:

117: Modeling the growth of these bipartite networks helps characterize

118: the underlying processes driving a research specialty, such as

119: knowledge accretion, researcher productivity, or collaboration

120: processes. Bipartite growth models produce many network metrics,

121: allowing comprehensive validation of models against real collections

122: of papers.

123:

124: \section{Basic bipartite Yule processes}

125:

126: As originally proposed, Yule processes do not model networks, but

127: simply model the formation of power-laws of frequencies of items

128: \cite{albert02} \cite{price76} \cite{simon55}. For a bipartite Yule

129: process, assume a bipartite network where nodes fall into two

130: partitions: 1) primary nodes and 2) secondary nodes. Typically,

131: primary nodes are papers while secondary nodes are entities that are

132: associated with papers, such as authors, references,  journals, or

133: terms.

134:

135: Figure 2 shows a diagram of a bipartite paper-reference network,

136: where the primary nodes are papers and the secondary nodes are

137: references, and papers are linked to references by citations.

138:

139: \begin{figure}

140: \resizebox{0.35\textwidth}{!}{%

141: \includegraphics{paper_ref.eps}}%

142: \caption{Diagram showing a bipartite network of papers and the

143: references that they cite.\label{pr}}

144: \end{figure}

145:

146: Figure 3 shows a diagram of a basic bipartite Yule process:

147:

148: \begin{figure}

149: \resizebox{.45\textwidth}{!}{%

150: \includegraphics{basic.eps}}%

151: \caption{Diagram of a basic bipartite Yule process.\label{basic}}

152: \end{figure}

153:

154: \begin{itemize}

155: \item The network grows by adding primary nodes one at a time.

156:

157: \item When a new primary node is added, it links to $N$ secondary nodes.

158:  $N$ is a random deviate drawn from a discrete probability distribution

159: that is a characteristic of the type of network being modeled. For

160: paper-reference networks $N$ is lognormally distributed

161: \cite{morris04a}, while for paper-author networks $N$ is 1-shifted

162: Poisson distributed \cite{goldstein04group} \cite{morris04b}. For

163: paper-journal networks, $N$ is unity, since a paper is only linked

164: to one journal, the one in which it was published. As defined here,

165: a primary entity does not link to any specific secondary entity more

166: than once.

167:

168:

169: \item For each of the $N$ links, there is a probability, $\alpha$, that it will link to a newly

170: appearing secondary node.

171:

172: \item If a link happens to be to an existing secondary node, the linked node is selected using

173: preferential attachment, that is, the probability of linking to a

174: secondary node is proportional to the number of links that the node

175: possesses.

176: \end{itemize}

177:

178:

179: The stationary distribution of the link degree of the secondary

180: nodes is a Yule distribution \cite{johnson92}\cite{simon55}, a power

181: law whose exponent is $1+1/(1-\alpha)$. The stationary distribution

182: is independent of the distribution of $N$, but for finite

183: collections of papers the distribution of $N$ profoundly affects the

184: tail of the distribution \cite{morris04a}.

185:

186: \section{Practical bipartite Yule processes}

187: In practice, the basic bipartite Yule process outlined in the

188: proceeding section must be modified to account for the

189: characteristics of the specific type of bipartite network being

190: studied.

191:

192: \subsection{Paper-reference Yule process}

193: Figure 4 shows a diagram of a bipartite Yule process modified for

194: the characteristics of paper-reference networks. The details of this

195: model, its scope, and a discussion of evidence of the its validity,

196: appear in \cite{morris04a}. Paper-reference networks in collections

197: of papers covering scientific specialties are characterized by the

198: accretion of highly cited exemplar references, which are cited at

199: rates far higher than would be predicted by simple preferential

200: attachment. These exemplar references tend to appear during the

201: initial growth of the network and their rate of appearance decreases

202: exponentially as papers are added to the collection.

203:

204: As each paper is added to the collection, it links to a lognormally

205: distributed number of references, as discussed in \cite{morris04a}.

206: For each reference cited by a paper, there is a probability $\alpha$

207: that the citation is to a newly appearing reference. When a new

208: reference appears, there is a small probability that the reference

209: will be a highly attractive exemplar reference. If so, the reference

210: receives a large initial attraction, $A_0$. Newly created

211: non-exemplar references received no initial attraction. If a

212: citation is to an existing reference, the probability that any

213: particular existing reference will be cited is proportional to the

214: sum of its attraction plus the number of times it has been cited. A

215: specific reference can not be cited more than once by a paper.

216:

217: \begin{figure}

218: \resizebox{.45\textwidth}{!}{%

219: \includegraphics{paper_ref_flowchart.eps}}%

220: \caption{Diagram showing a bipartite Yule process for

221: paper-reference networks.\label{prproc}}

222: \end{figure}

223:

224: \subsection{Paper-author Yule process}

225: Figure 5 shows a diagram of the basic bipartite Yule process

226: modified for the characteristics of paper-author networks. The

227: details of this model, its scope, and a discussion of evidence of

228: the its validity, appear in \cite{goldstein04group} and

229: \cite{morris04b}. In this case the Yule process is applied to teams

230: of researchers rather than individual researchers. As each paper is

231: added, there is a probability   that the paper will be authored by a

232: new research team.  If so, a team of $N_G$ authors is added to the

233: network, but only $N(\lambda)$ appear as authors of the team's first

234: paper, where $N(\lambda)$ is a random deviate drawn from a 1-shifted

235: Poisson distribution whose parameter is $\lambda$.  If choosing an

236: existing team,  the teams are chosen using preferential attachment,

237: that is, the probability that a team will author the new paper is

238: proportional to the number of papers that the team has previously

239: published.

240:

241: \begin{figure}

242: \resizebox{0.45\textwidth}{!}{%

243: \includegraphics{paper_auth_flowchart.eps}}%

244: \caption{Diagram showing a bipartite Yule process for paper-author

245: networks.\label{paproc}}

246: \end{figure}

247:

248:

249: When selecting authors for an existing team's paper, $N(\lambda)$

250: authors are chosen and the authors are selected using preferential

251: attachment, specifically, the probability of selecting an author is

252: proportional to 1 plus the number of papers that the author has

253: published. Inter-team collaborations (weak ties) are modeled as

254: random events; when an existing author is to be selected there is a

255: probability $\beta$ that the author will be drawn randomly from some

256: other team.

257:

258:

259: \section{Network metrics}

260: Simulation using a bipartite Yule process fully preserves the

261: topology of the network phenomenon being studied. The adjacency

262: matrix for a bipartite network is a roughly lower triangular

263: rectangular matrix. Figure 6 shows the adjacency matrices of the

264: paper-reference network, paper-author network, and paper-journal

265: network in an actual collection of papers.

266:

267: \begin{figure*}

268: \resizebox{1\textwidth}{!}{%

269: \includegraphics{figurematrix.eps}}%

270: \caption{Diagrams of adjacency matrices of bipartite networks in a

271: collection of 902 papers on the topic of complex

272: networks.\label{matrix}}

273: \end{figure*}

274:

275:

276: From each bipartite network, two co-occurrence networks can be

277: derived with their own characteristic topology.  For example, a

278: paper-reference network yields two unipartite networks, a

279: bibliographic coupling network of papers linked by common references

280: and a co-citation network of references linked by their common

281: papers. A paper-author network yields a  collaboration network of

282: authors connected by common papers and also a network of papers

283: connected by common authors.

284:

285: Network metrics that characterize a bipartite network can be derived

286: from link degree distributions in the bipartite network and link

287: degree distributions in the associated unipartite co-occurrence

288: networks. Many of these metrics can be tied to indicators of the

289: underlying research process generating the collection of papers.

290:

291: A set of useful metrics for paper-reference networks includes:

292: \begin{itemize}

293: \item \textit{reference per paper distribution} - This tends to be a

294: lognormal distribution whose mean, $m$, is from 15 to 30 references

295: per paper \cite{morris04a}.

296: \item \textit{paper per reference

297: distribution} - This tends to be a power-law distribution with a

298: characteristic exponent that ranges from 2 to 4

299: \cite{naranan71}\cite{redner98}.

300: \item \textit{bibliographic coupling strength per

301: paper pair distribution} - This is the link weight distribution of

302: the bibliographic coupling network.

303:

304: \item \textit{co-citation coupling strength per reference pair distribution} -

305: This is the link weight distribution of the co-citation network.

306: \item \textit{bibliographic coupling clustering coefficient

307: distribution} - This the distribution of the clustering coefficients

308: for the bibliographic coupling network.

309: \end{itemize}

310: In paper-reference networks, the mean references per paper is

311: typically about 30, while the mean papers per reference is typically

312: about 1.4, the mean of a zeta (pure power-law) distribution with

313: exponent of 3. This constrains the ratio of references to papers in

314: the collection to be about 20, that is, a collection of papers

315: typically has about 20 times more references than papers.

316:

317: A set of useful metrics for paper-author networks includes.

318: \begin{itemize}

319: \item \textit{authors per paper distribution} - This tends to be a 1-shifted

320: Poisson distribution whose mean varies from 2 for fields such as

321: mathematics to more than 10 for biomedical fields \cite{morris04b}.

322: \item \textit{paper per author distribution} - This tends to be a

323: power-law (Lotka's Law), whose exponent ranges from 2 to 4

324: \cite{lotka26}.

325: \item \textit{collaborating author distribution} - This is the

326: distribution of the number of unique co-authors per author in the

327: collection, and is the link degree distribution of the unweighted

328: co-authorship network.

329: \item \textit{co-authorship per author pair

330: distribution} - This is the link weight distribution of the weighted

331: co-authorship network.

332: \item \textit{co-authorship clustering coefficient

333: distribution} - This is the clustering coefficient of the unweighted

334: co-authorship network.

335: \item \textit{minimum co-authorship path length

336: distribution} - This is the distribution of minimum pathlengths

337: between author pairs in the unweighted co-authorship network.

338: \end{itemize}

339:

340: \section{Examples}

341: \subsection{Example simulation of paper-reference network}

342: The Yule model for paper-reference networks was tested on a

343: collection of papers that cover the topic of complex networks.  This

344: collection was gathered on September 8th, 2003 from ISI's Web of

345: Science product using a series of queries to find all papers that

346: cite key references and authors in the specialty.  The collection

347: contains 902 papers with 31355 citations to 19185 references.  The

348: Yule parameter, $\alpha$, estimated by dividing the number of

349: references by the number of citations to references, is 0.61.  The

350: mean references per paper is 34.8. The parameters used for the

351: bipartite Yule simulation of this collection can be found in

352: \cite{morris04a}.

353:

354: \begin{figure*}

355: \resizebox{.9\textwidth}{!}{%

356: \includegraphics{figure7.eps}}%

357: \caption{Comparison plots of paper per reference frequency (upper

358: left), bibliographic coupling strength frequency (upper right),

359: co-citation strength frequency (lower left), and bibliographic

360: coupling clustering coefficient distribution (lower right), from a

361: collection of 902 papers on the topic of complex networks.

362: \label{pr_results}}

363: \end{figure*}

364:

365: Figure 7 show plots comparing network metrics from the actual data

366: to a Yule simulation of network growth. The upper left plot is of

367: papers per reference frequencies. Maximum likelihood expectation

368: (MLE) estimated power-law exponents are 3.0 for the actual

369: frequencies, and 2.85 for the simulation. The paper-reference Yule

370: process mimics the phenomenon of exceptionally highly cited exemplar

371: references in the extreme lower right of the plot.  The upper right

372: plot is of frequency of bibliographic coupling strength per paper

373: pair. The Yule process-based simulation frequencies match the actual

374: frequencies well. The series of high bibliographic coupling strength

375: pairs in the lower right from actual data corresponds to pairs of

376: review papers with long lists of almost identical references, a

377: phenomenon not modeled by the Yule process. The lower left plot of

378: Figure 7 is of frequency of co-citation strength per reference pair.

379: The simulated frequencies match the actual frequencies well across

380: the whole plot. The lower right plot is of bibliographic coupling

381: clustering coefficient distribution. The simulated distribution

382: matches the shape and scale of the actual data.

383:

384: \subsection{Example simulation of a paper-author network}

385: The Yule model for paper-author networks was tested on three

386: collections of papers representing specialties with a wide range of

387: collaboration intensities. A collection of 1391 papers on the topic

388: of distance learning with 51\% single-authored papers represents a

389: specialty with little collaboration. A collection of 900 papers on

390: the topic of complex networks with 21\% single-authored papers

391: represents a specialty with typical amount of collaboration.

392: Finally, a collection of 3095 papers on the topic of atrial ablation

393: with 7\% single-authored papers represents a specialty with heavy

394: collaboration \cite{morris04b}. The parameters used for bipartite

395: Yule simulation of these paper-author networks can be found in

396: \cite{morris04b}.

397:

398: Figures 8, 9 and 10 show the comparison of Yule model simulations to

399: actual data for these three collections using two metrics: 1) paper

400: per author frequency (Lotka's Law), and 2) collaborating author

401: frequency.

402:

403: \begin{figure*}

404: \resizebox{.9\textwidth}{!}{%

405: \includegraphics{figure_distance.eps}}%

406: \caption{Comparison of bipartite Yule simulation against actual data

407:  for plots of paper per author frequencies and collaborating author

408:  frequencies for the distance education paper collection.\label{distance}}

409: \end{figure*}

410:

411: \begin{figure*}

412: \resizebox{.9\textwidth}{!}{%

413: \includegraphics{figure_complex.eps}}%

414: \caption{Comparison of bipartite Yule simulation against actual data

415:  for plots of paper per author frequencies and collaborating author

416:  frequencies for the complex networks paper collection.\label{networks}}

417: \end{figure*}

418:

419: \begin{figure*}

420: \resizebox{.9\textwidth}{!}{%

421: \includegraphics{figure_atrial.eps}}%

422: \caption{Comparison of bipartite Yule simulation against actual data

423:  for plots of paper per author frequencies and collaborating author

424:  frequencies for the atrial ablation paper collection.\label{atrial}}

425: \end{figure*}

426:

427: The left plots in Figures 8, 9 and 10 are paper per author frequency

428: plots. The bipartite Yule process produces excellent matches to

429: actual data. The inset plots show Yule model predicted paper per

430: author distributions derived by gathering statistics from 1000

431: simulations for each collection. A line representing an MLE fitted

432: zeta (pure power-law) distribution is shown in each inset. The Yule

433: model produces excellent fits to the zeta distribution for all three

434: collections, confirming the Yule model's usefulness as a predictor

435: of Lotka's Law. Note that the deviation of the distributions from

436: the zeta distribution in the tail of the distributions is due to

437: truncating the simulations at the number of papers in each

438: collection.  The plots on the right side of Figures 8, 9 and 10 show

439: that the bipartite Yule model produces good matches of collaborating

440: author frequencies to actual data across the wide rage of

441: collaboration intensities represented by the three collections.

442:

443: \begin{figure}

444: \resizebox{.5\textwidth}{!}{%

445: \includegraphics{couple_ap_p_r.eps}}%

446: \caption{Example of coupled bipartite networks. The paper-author

447: network is coupled to the paper-reference network through common

448: papers. \label{example}}

449: \end{figure}

450:

451: \section{Future work}

452: The research on bipartite Yule processes discussed here will be

453: extended to modeling of coupled bipartite networks. Figure 10 shows

454: an example of coupled bipartite networks, where a paper-author

455: network is coupled to a paper reference network through common

456: papers. The challenge is to invent a model that reproduces the

457: correlation of groups of authors to groups of references, a

458: phenomenon that cannot be modeled using two separate bipartite

459: processes.

460:

461:

462:

463: % Put \label in argument of \section for cross-referencing

464: %\section{\label{}}

465: %\subsection{}

466: %\subsubsection{}

467:

468: % If in two-column mode, this environment will change to single-column

469: % format so that long equations can be displayed. Use

470: % sparingly.

471: %\begin{widetext}

472: % put long equation here

473: %\end{widetext}

474:

475: % figures should be put into the text as floats.

476: % Use the graphics or graphicx packages (distributed with LaTeX2e)

477: % and the \includegraphics macro defined in those packages.

478: % See the LaTeX Graphics Companion by Michel Goosens, Sebastian Rahtz,

479: % and Frank Mittelbach for instance.

480: %

481: % Here is an example of the general form of a figure:

482: % Fill in the caption in the braces of the \caption{} command. Put the label

483: % that you will use with \ref{} command in the braces of the \label{} command.

484: % Use the figure* environment if the figure should span across the

485: % entire page. There is no need to do explicit centering.

486:

487: % \begin{figure}

488: % \includegraphics{}%

489: % \caption{\label{}}

490: % \end{figure}

491:

492: % Surround figure environment with turnpage environment for landscape

493: % figure

494: % \begin{turnpage}

495: % \begin{figure}

496: % \includegraphics{}%

497: % \caption{\label{}}

498: % \end{figure}

499: % \end{turnpage}

500:

501: % tables should appear as floats within the text

502: %

503: % Here is an example of the general form of a table:

504: % Fill in the caption in the braces of the \caption{} command. Put the label

505: % that you will use with \ref{} command in the braces of the \label{} command.

506: % Insert the column specifiers (l, r, c, d, etc.) in the empty braces of the

507: % \begin{tabular}{} command.

508: % The ruledtabular enviroment adds doubled rules to table and sets a

509: % reasonable default table settings.

510: % Use the table* environment to get a full-width table in two-column

511: % Add \usepackage{longtable} and the longtable (or longtable*}

512: % environment for nicely formatted long tables. Or use the the [H]

513: % placement option to break a long table (with less control than

514: % in longtable).

515: % \begin{table}%[H] add [H] placement to break table across pages

516: % \caption{\label{}}

517: % \begin{ruledtabular}

518: % \begin{tabular}{}

519: % Lines of table here ending with \\

520: % \end{tabular}

521: % \end{ruledtabular}

522: % \end{table}

523:

524: % Surround table environment with turnpage environment for landscape

525: % table

526: % \begin{turnpage}

527: % \begin{table}

528: % \caption{\label{}}

529: % \begin{ruledtabular}

530: % \begin{tabular}{}

531: % \end{tabular}

532: % \end{ruledtabular}

533: % \end{table}

534: % \end{turnpage}

535:

536: % Specify following sections are appendices. Use \appendix* if there

537: % only one appendix.

538: %\appendix

539: %\section{}

540:

541: % If you have acknowledgments, this puts in the proper section head.

542: %\begin{acknowledgments}

543: % put your acknowledgments here.

544: %\end{acknowledgments}

545:

546: % Create the reference section using BibTeX:

547: \bibliography{yule}

548:

549: \end{document}

550: %

551: % ****** End of file template.aps ******

552: