0805:0805.0307/ms.tex

1: % $Id: ms.tex,v 1.6 2008/04/30 15:56:19 jdietric Exp $

2: % $Date: 2008/04/30 15:56:19 $

3: % $Author: jdietric $

4: % $Revision: 1.6 $

5:

6: \documentclass{emulateapj}

7:

8: \bibliographystyle{apj}

9: \bibpunct{(}{)}{;}{a}{}{,}

10:

11: \begin{document}

12: \title{Disentangling Visibility and Self-Promotion Bias in the

13:   arXiv:astro-ph Positional Citation Effect}

14:

15: \author{J.\,P. Dietrich}

16: \affil{ESO, Karl-Schwarzschild-Stra{\ss}e 2, 85748 Garching b. M\"unchen,

17:   Germany}

18: \email{jdietric@eso.org}

19:

20: \begin{abstract}

21:   We established in an earlier study that articles listed at or near

22:   the top of the daily arXiv:astro-ph mailings receive on average

23:   significantly more citations than articles further down the list. In

24:   our earlier work we were not able to decide whether this positional

25:   citation effect was due to author self-promotion of intrinsically

26:   more citable papers or whether papers are cited more often simply

27:   because they are at the top of the astro-ph listing.

28:   Using new data we can now disentangle both effects.

29:   Based on their submission times we separate articles into a

30:   self-promoted sample and a sample of articles that achieved a high

31:   rank on astro-ph by chance and compare their citation distributions

32:   with those of articles in lower astro-ph positions.

33:   We find that the positional citation effect is a superposition of

34:   self-promotion and visibility bias.

35: \end{abstract}

36: \keywords{sociology of astronomy -- astronomical

37:     data bases: miscellaneous}

38:

39: \section{Introduction}

40: \label{sec:introduction}

41: %

42: In \citet[][Paper

43: I]{2008PASP..120..224D}\defcitealias{2008PASP..120..224D}{Paper I} we

44: studied the effect of an e-Print's placement in the daily

45: arXiv:astro-ph listing on the number of citation it gets. We found

46: that e-Prints appearing at or near the top of the astro-ph mailings

47: receive significantly more citation than articles further down the

48: list. We proposed three non-exclusive effects to explain this

49: \emph{positional citation effect} (PCE). These are defined as:

50: \begin{itemize}

51: \item The Visibility Bias (VB) postulate -- Papers appearing at the

52:   top of the astro-ph listing are seen by more people and thus cited

53:   more often than those further down the list, where the attention of

54:   the astro-ph readers might decrease;

55: \item The Self-promotion Bias (SP) postulate -- Authors tend to

56:   promote their most important works, and thus most citable articles,

57:   by placing them at prominent positions;

58: \item The Geography Bias (GB) postulate -- The submission deadline

59:   preferentially puts those authors at the top of the listing whose

60:   working hours coincide with the submission deadline. This group

61:   already has higher citation counts for other reasons.

62: \end{itemize}

63: The last postulate pertains to the facts that (1) US American authors have

64: a higher fraction of highly cited papers than their European

65: colleagues \citep{2007EurRev..15..3S} and (2) the arXiv submission

66: deadline of 16:00~EST/EDT is within the normal working hours of

67: astronomers in the US, while it is not for European astronomers.

68:

69: We concluded in \citetalias{2008PASP..120..224D} that GB is not the

70: cause of the observed PCE because the effect is found independently in

71: the samples of European and US authors. We proposed to disentangle VB

72: and SP by the following method: using the submission times of e-Prints

73: and grouping them into two samples, one that is submitted so shortly

74: after the deadline that it is statistically expected to be

75: self-promoted, and a second one that is submitted long enough after

76: the deadline to exclude self-promotion, and repeating the citation

77: analysis for both samples, one can distinguish between SP and VB.

78: According to information we received at the time of writing

79: \citetalias{2008PASP..120..224D} from arXiv administrators, the

80: initial submission times are not stored. Consequently, we were not

81: able to decide whether the PCE is caused by VB or SP. Meanwhile we

82: were contacted by arXiv staff informing us that, in fact, the original

83: submission times, although indeed not stored as part of an e-Prints

84: record, can be recovered from the server log files. This now enables

85: us to perform the timing analysis proposed in

86: \citetalias{2008PASP..120..224D}.

87:

88: \section{Analysis}

89: \label{sec:analysis}

90: %

91: We use the same sample as in \citetalias{2008PASP..120..224D}, i.e.,

92: astro-ph e-Prints published between the beginning of July 2002 and the

93: end of December 2005. Citation data for these e-Prints were gathered

94: from NASA's ADS Bibliographic Services\footnote{Access this service

95:   through http://adsabs.harvard.edu/index.html}. We do not correct for

96: the fact that older papers had more time to gather citations than

97: e-Prints published towards the end of the period under investigation

98: here. For every astro-ph e-Print published in one of the core journals

99: of Astronomy (in agreement with \citet{2005IPM....41.1395K} we define

100: these as \emph{The Astrophysical Journal (Letters)} and its

101: \emph{Supplement Series}, \emph{Astronomy \& Astrophysics},

102: \emph{Monthly Notices of the Royal Astronomical Society}, \emph{The

103:   Astronomical Journal}, and \emph{Publications of the Astronomical

104:   Society of the Pacific}) we compute the time $t_\textrm{s}$ passed

105: from the last submission deadline to the submission time of the

106: article to the arXiv server. We ruled out GB as the sole cause of the

107: PCE in \citetalias{2008PASP..120..224D} but we now restrict our

108: analysis to articles whose first author's first affiliation is in

109: North or South America. The reasons for this choice are that European

110: authors must self-promote to achieve the top position on astro-ph,

111: weakening any VB signal if present, and to avoid any residual signal

112: from GB.  Restricting our analysis to American (North and South)

113: authors we avoid a correlation of different citation distributions

114: with submission behavior while at the same time maximizing the sample

115: size.

116:

117: We perform an analysis of the citation counts similar to the one in

118: \citetalias{2008PASP..120..224D}. The citation distribution is a power

119: law \citep{1998EPJB....4..131R}, which is best analyzed using a Zipf

120: plot. A Zipf plot shows the citations on the $r^\mathrm{th}$ most

121: cited paper out of an ensemble of size $M$ versus its rank $r$ or, if

122: several samples of different sizes are to be compared, its normalized

123: rank $r/M$.  Figure~\ref{fig:zipf-timing} shows Zipf plots for three

124: different samples of core journal articles; two samples of articles in

125: the first three astro-ph positions, one submitted very shortly after

126: the deadline ($t_\mathrm{s}<300$\,s) very likely aimed at

127: self-promotion and one submitted obviously without the intent to

128: achieve the ``pole position'' ($t_\mathrm{s} > 5400$\,s). We bin the

129: first three positions because the PCE is much for stronger for them

130: than for lower positions at which it is still significant and to

131: average out the noise that would dominate in studying individual

132: positions.  The third sample contains articles that appeared at

133: astro-ph positions 26--30 at any submission time.

134:

135: \begin{figure}

136:   \plotone{f1.eps}

137:   \caption{Zipf plots for the timing analysis. The $x$-axis show the

138:     normalized rank of astro-ph postings in their respective samples

139:     after sorting them by citations. The $y$-axis gives the number of

140:     citations. The different color/line-styles represent the different

141:     samples under investigation. The solid red line represents

142:     e-Prints in the first three astro-ph positions posted within the

143:     first 5 minutes after starting a new list. The dashed blue line is

144:     the Zipf law for articles in the same positions but posted more

145:     than 1.5\,h after the deadline. The dotted green line gives all

146:     articles of American authors in positions 26--30 for comparison.}

147:   \label{fig:zipf-timing}

148: \end{figure}

149:

150: We clearly see that the three curves, while their slopes are roughly

151: equal, are at different loci, corresponding to different

152: normalizations of the citation distribution power law. The highest

153: curve, i.e. the highest normalization of the citation power law is the

154: sample of articles submitted shortly after the deadline. Articles

155: listed at the top positions but submitted later are cited less often,

156: with the exception of the three most cited articles in this sample,

157: but still considerably more often than articles further down the list.

158:

159: To quantify the impact of VB and SP we compute the average citations a

160: paper gets in the range $-3.0 < \ln(r/M) < -1.0$. We choose this range

161: to avoid the bulk of mostly ignored papers and especially to avoid

162: being dominated by a few highly cited papers. We find that articles in

163: the early sample are on average cited $34.4\pm1.1$ times, while

164: articles from the later sample are cited $26.2^{+1.3}_{-1.4}$ times.

165: The comparison sample from astro-ph positions 26--28 has a mean

166: citation number of $22.0\pm0.7$. The quoted errors are the 68\%

167: confidence intervals estimated from bootstrap resampling the citation

168: counts in the selected interval.

169:

170: \begin{figure}

171:   \plotone{f2.eps}

172:   \caption{Average citation numbers for articles in the first three

173:     positions of astro-ph depending on their submission time. The

174:     horizontal error bars denote the width of the sample bins. The

175:     green horizontal line corresponds to the average citations (with

176:     dotted error bars) of e-Prints at positions 26--30.}

177:   \label{fig:timing}

178: \end{figure}

179:

180: We repeat this calculation for three additional time intervals and

181: plot the result in Fig.~\ref{fig:timing}. We find that after the

182: initial rush of self-promoted papers the citation rates slowly drop

183: for e-Prints submitted later after the deadline. This confirms a

184: contribution of SP to the PCE. We also find that papers submitted more

185: than 1.5\,h after the deadline, i.e., those e-Prints that achieved a

186: high position in the astro-ph listing almost certainly by chance and

187: not by the submitter's intent, are still cited significantly more

188: often ($3\sigma$) than papers further down the list. This proves that

189: also VB contributes to the PCE. We note that these results are

190: independent of the exact binning that is employed. Choosing different

191: bins close to and far away from the submission deadline moves the

192: points in Fig.~\ref{fig:timing} somewhat up and down. The overall

193: result that papers submitted shortly after the deadline have higher

194: citation rates than e-Prints submitter later remains, as does the

195: difference in citation numbers between late articles at the top and

196: articles down the astro-ph listing. The presence of both SP and VB

197: thus does not depend on the exact binning employed.

198:

199:

200:

201: \section{Summary and Discussion}

202: \label{sec:concl}

203: %

204: We studied the factors contributing to the increased number of

205: citations e-Prints at the top of the daily astro-ph listing receive

206: compared to e-Prints listed further down the astro-ph mailing. By

207: making a timing argument we constructed samples of e-Prints appearing

208: at the three first positions of astro-ph that are either (1) almost

209: certainly submitted with the intent of getting the top spot; or (2)

210: achieved a high position of astro-ph purely by chance; or (3) fall

211: somewhere in between and have a mixture of categories.

212:

213: We found that the sample of self-promoted papers indeed has the

214: highest citation rates. This shows that self-promotion as a mechanism

215: that preferentially puts intrinsically more citable papers to the top

216: of the astro-ph listings in fact works. This is not surprising,

217: considering that \citet{2005IPM....41.1395K} found evidence for a

218: self-selection bias in which papers authors post on astro-ph. We, in

219: turn, find a similar effect within the e-Prints on astro-ph.

220:

221: Arguably, the more important finding of this work is the difference in

222: citation rates between the not self-promoted sample of e-Prints and

223: articles appearing much lower in the astro-ph mailing. The citation

224: rates for the late sample are lower than for the self-promoted sample

225: but still significantly higher than for articles lower in the astro-ph

226: mailing. This provides strong evidence for the visibility bias theory

227: that articles are cited more often, not due to some inherent quality

228: they have, but simply because they are at the top of the astro-ph

229: listing.

230:

231: Citation counts are often used to evaluate the scientific quality of

232: individuals or institutions and hiring or funding decisions are partly

233: based on them. Our finding that a visibility bias exists at the top of

234: arXiv:astro-ph should provide a strong cautionary note concerning the

235: use of such statistics. We also note that the fraction of astro-ph

236: e-Prints submitted very shortly after the deadline increased during

237: the interval under study here. In the second half of 2002 1.5\%

238: (2.9\%) of all e-Prints were submitted within the first 60\,s

239: (300\,s). In the second half of 2005 these numbers rose to 2.3\%

240: (4.6\%). The chance that this is a statistical fluctuation is smaller

241: than 0.01\%. This change in submission behavior appears to be

242: indicative of a growing feeling in the astronomical community that VB

243: plays a role and that citations are not awarded purely on merit of the

244: work presented in a paper.

245:

246: One could simply get rid of VB by randomizing the order of the

247: astro-ph listing for every reader. In this way the VB would average

248: out over the readership of astro-ph. However, by doing so one would

249: ignore the underlying problem that leads to VB in the first

250: place. Everyday astronomers are confronted with an enormous amount of

251: new information, which they have to sort, classify, and ultimately

252: decide what is of relevance for their own research.

253:

254: Publications are never cited without a reason, i.e., VB does not lead

255: to unjustified additional citations of a paper. Thus, we must draw the

256: conclusion that papers down the astro-ph list are overlooked and not

257: cited when they should be. Any randomization would mitigate the VB

258: problem by changing the set of papers that does not get the attention

259: it deserves but it would not fix the real problem, i.e., the

260: information overload which astronomers face every day. It is important

261: to realize that the VB effect on citations is only a secondary

262: effect. The primary effect, from which the citation inequality

263: follows, is that researchers are not aware of relevant publications

264: and results in their own field. This is potentially an impediment

265: for science and the real problem that needs fixing. Since we cannot

266: expect the number of publications to decrease, the solution has to be

267: in the way information is presented.

268:

269: Only a relatively small subset of e-Prints is relevant to any

270: individual researcher and the daily challenge is to identify these in

271: the much larger astro-ph listing. A possible first step in this

272: direction is the arxivsorter \citep{2007Arxivsorter..M}, which we

273: already mentioned in \citetalias{2008PASP..120..224D}. Arxivsorter

274: aims to sort daily, recent, or monthly astro-ph listings by relevance

275: to an individual reader. The underlying idea is that scientists

276: through co-authorship form an interconnected network of authors. By

277: specifying a few authors relevant to a reader's fields of interest,

278: the ``proximity'' of a new e-print in the author network can be

279: calculated. This proximity seems to be a good proxy for relevance to a

280: reader's interests.

281:

282:

283: \acknowledgements The original submission times of e-Prints were

284: provided by Paul Ginsparg. I thank Bruno Leibundgut, Brice M\'enard,

285: Uta Grothkopf, and the anonymous referee for comments that helped to

286: improve the manuscript. This research has made use of NASA's

287: Astrophysics Data System Bibliographic Services.

288:

289: \begin{thebibliography}{4}

290: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi

291:

292: \bibitem[{{Dietrich}(2008)}]{2008PASP..120..224D}

293: {Dietrich}, J.~P. 2008, \pasp, 120, 224

294:

295: \bibitem[{{Habing}(2007)}]{2007EurRev..15..3S}

296: {Habing}, H. 2007, {European Review}, 15, 3

297:

298: \bibitem[{{Kurtz} {et~al.}(2005){Kurtz}, {Eichhorn}, {Accomazzi}, {Grant},

299:   {Demleitner}, {Henneken}, \& {Murray}}]{2005IPM....41.1395K}

300: {Kurtz}, M.~J., {Eichhorn}, G., {Accomazzi}, A., {Grant}, C., {Demleitner}, M.,

301:   {Henneken}, E., \& {Murray}, S.~S. 2005, Information Processing and

302:   Management, 41, 1395

303:

304: \bibitem[{{Magu\'e} \& {M\'enard}(2007)}]{2007Arxivsorter..M}

305: {Magu\'e}, J.-P. \& {M\'enard}, B. 2007, {Arxivsorter documentation, {\tt

306:   http://arxivsorter.org/doc}}

307:

308: \bibitem[{{Redner}(1998)}]{1998EPJB....4..131R}

309: {Redner}, S. 1998, European Physical Journal B, 4, 131

310:

311: \end{thebibliography}

312:

313:

314: \end{document}

315: