0805.0307/ms.tex
1: % $Id: ms.tex,v 1.6 2008/04/30 15:56:19 jdietric Exp $
2: % $Date: 2008/04/30 15:56:19 $
3: % $Author: jdietric $
4: % $Revision: 1.6 $
5: 
6: \documentclass{emulateapj}
7: 
8: \bibliographystyle{apj}
9: \bibpunct{(}{)}{;}{a}{}{,}
10: 
11: \begin{document}
12: \title{Disentangling Visibility and Self-Promotion Bias in the
13:   arXiv:astro-ph Positional Citation Effect} 
14: 
15: \author{J.\,P. Dietrich} 
16: \affil{ESO, Karl-Schwarzschild-Stra{\ss}e 2, 85748 Garching b. M\"unchen,
17:   Germany} 
18: \email{jdietric@eso.org}
19: 
20: \begin{abstract}
21:   We established in an earlier study that articles listed at or near
22:   the top of the daily arXiv:astro-ph mailings receive on average
23:   significantly more citations than articles further down the list. In
24:   our earlier work we were not able to decide whether this positional
25:   citation effect was due to author self-promotion of intrinsically
26:   more citable papers or whether papers are cited more often simply
27:   because they are at the top of the astro-ph listing.
28:   Using new data we can now disentangle both effects.
29:   Based on their submission times we separate articles into a
30:   self-promoted sample and a sample of articles that achieved a high
31:   rank on astro-ph by chance and compare their citation distributions
32:   with those of articles in lower astro-ph positions.
33:   We find that the positional citation effect is a superposition of
34:   self-promotion and visibility bias.
35: \end{abstract}
36: \keywords{sociology of astronomy -- astronomical
37:     data bases: miscellaneous}
38: 
39: \section{Introduction}
40: \label{sec:introduction}
41: %
42: In \citet[][Paper
43: I]{2008PASP..120..224D}\defcitealias{2008PASP..120..224D}{Paper I} we
44: studied the effect of an e-Print's placement in the daily
45: arXiv:astro-ph listing on the number of citation it gets. We found
46: that e-Prints appearing at or near the top of the astro-ph mailings
47: receive significantly more citation than articles further down the
48: list. We proposed three non-exclusive effects to explain this
49: \emph{positional citation effect} (PCE). These are defined as:
50: \begin{itemize}
51: \item The Visibility Bias (VB) postulate -- Papers appearing at the
52:   top of the astro-ph listing are seen by more people and thus cited
53:   more often than those further down the list, where the attention of
54:   the astro-ph readers might decrease;
55: \item The Self-promotion Bias (SP) postulate -- Authors tend to
56:   promote their most important works, and thus most citable articles,
57:   by placing them at prominent positions;
58: \item The Geography Bias (GB) postulate -- The submission deadline
59:   preferentially puts those authors at the top of the listing whose
60:   working hours coincide with the submission deadline. This group
61:   already has higher citation counts for other reasons.
62: \end{itemize}
63: The last postulate pertains to the facts that (1) US American authors have
64: a higher fraction of highly cited papers than their European
65: colleagues \citep{2007EurRev..15..3S} and (2) the arXiv submission
66: deadline of 16:00~EST/EDT is within the normal working hours of
67: astronomers in the US, while it is not for European astronomers.
68: 
69: We concluded in \citetalias{2008PASP..120..224D} that GB is not the
70: cause of the observed PCE because the effect is found independently in
71: the samples of European and US authors. We proposed to disentangle VB
72: and SP by the following method: using the submission times of e-Prints
73: and grouping them into two samples, one that is submitted so shortly
74: after the deadline that it is statistically expected to be
75: self-promoted, and a second one that is submitted long enough after
76: the deadline to exclude self-promotion, and repeating the citation
77: analysis for both samples, one can distinguish between SP and VB.
78: According to information we received at the time of writing
79: \citetalias{2008PASP..120..224D} from arXiv administrators, the
80: initial submission times are not stored. Consequently, we were not
81: able to decide whether the PCE is caused by VB or SP. Meanwhile we
82: were contacted by arXiv staff informing us that, in fact, the original
83: submission times, although indeed not stored as part of an e-Prints
84: record, can be recovered from the server log files. This now enables
85: us to perform the timing analysis proposed in
86: \citetalias{2008PASP..120..224D}.
87: 
88: \section{Analysis}
89: \label{sec:analysis}
90: %
91: We use the same sample as in \citetalias{2008PASP..120..224D}, i.e.,
92: astro-ph e-Prints published between the beginning of July 2002 and the
93: end of December 2005. Citation data for these e-Prints were gathered
94: from NASA's ADS Bibliographic Services\footnote{Access this service
95:   through http://adsabs.harvard.edu/index.html}. We do not correct for
96: the fact that older papers had more time to gather citations than
97: e-Prints published towards the end of the period under investigation
98: here. For every astro-ph e-Print published in one of the core journals
99: of Astronomy (in agreement with \citet{2005IPM....41.1395K} we define
100: these as \emph{The Astrophysical Journal (Letters)} and its
101: \emph{Supplement Series}, \emph{Astronomy \& Astrophysics},
102: \emph{Monthly Notices of the Royal Astronomical Society}, \emph{The
103:   Astronomical Journal}, and \emph{Publications of the Astronomical
104:   Society of the Pacific}) we compute the time $t_\textrm{s}$ passed
105: from the last submission deadline to the submission time of the
106: article to the arXiv server. We ruled out GB as the sole cause of the
107: PCE in \citetalias{2008PASP..120..224D} but we now restrict our
108: analysis to articles whose first author's first affiliation is in
109: North or South America. The reasons for this choice are that European
110: authors must self-promote to achieve the top position on astro-ph,
111: weakening any VB signal if present, and to avoid any residual signal
112: from GB.  Restricting our analysis to American (North and South)
113: authors we avoid a correlation of different citation distributions
114: with submission behavior while at the same time maximizing the sample
115: size.
116: 
117: We perform an analysis of the citation counts similar to the one in
118: \citetalias{2008PASP..120..224D}. The citation distribution is a power
119: law \citep{1998EPJB....4..131R}, which is best analyzed using a Zipf
120: plot. A Zipf plot shows the citations on the $r^\mathrm{th}$ most
121: cited paper out of an ensemble of size $M$ versus its rank $r$ or, if
122: several samples of different sizes are to be compared, its normalized
123: rank $r/M$.  Figure~\ref{fig:zipf-timing} shows Zipf plots for three
124: different samples of core journal articles; two samples of articles in
125: the first three astro-ph positions, one submitted very shortly after
126: the deadline ($t_\mathrm{s}<300$\,s) very likely aimed at
127: self-promotion and one submitted obviously without the intent to
128: achieve the ``pole position'' ($t_\mathrm{s} > 5400$\,s). We bin the
129: first three positions because the PCE is much for stronger for them
130: than for lower positions at which it is still significant and to
131: average out the noise that would dominate in studying individual
132: positions.  The third sample contains articles that appeared at
133: astro-ph positions 26--30 at any submission time.
134: 
135: \begin{figure}
136:   \plotone{f1.eps}
137:   \caption{Zipf plots for the timing analysis. The $x$-axis show the
138:     normalized rank of astro-ph postings in their respective samples
139:     after sorting them by citations. The $y$-axis gives the number of
140:     citations. The different color/line-styles represent the different
141:     samples under investigation. The solid red line represents
142:     e-Prints in the first three astro-ph positions posted within the
143:     first 5 minutes after starting a new list. The dashed blue line is
144:     the Zipf law for articles in the same positions but posted more
145:     than 1.5\,h after the deadline. The dotted green line gives all
146:     articles of American authors in positions 26--30 for comparison.}
147:   \label{fig:zipf-timing}
148: \end{figure}
149: 
150: We clearly see that the three curves, while their slopes are roughly
151: equal, are at different loci, corresponding to different
152: normalizations of the citation distribution power law. The highest
153: curve, i.e. the highest normalization of the citation power law is the
154: sample of articles submitted shortly after the deadline. Articles
155: listed at the top positions but submitted later are cited less often,
156: with the exception of the three most cited articles in this sample,
157: but still considerably more often than articles further down the list.
158: 
159: To quantify the impact of VB and SP we compute the average citations a
160: paper gets in the range $-3.0 < \ln(r/M) < -1.0$. We choose this range
161: to avoid the bulk of mostly ignored papers and especially to avoid
162: being dominated by a few highly cited papers. We find that articles in
163: the early sample are on average cited $34.4\pm1.1$ times, while
164: articles from the later sample are cited $26.2^{+1.3}_{-1.4}$ times.
165: The comparison sample from astro-ph positions 26--28 has a mean
166: citation number of $22.0\pm0.7$. The quoted errors are the 68\%
167: confidence intervals estimated from bootstrap resampling the citation
168: counts in the selected interval. 
169: 
170: \begin{figure}
171:   \plotone{f2.eps}
172:   \caption{Average citation numbers for articles in the first three
173:     positions of astro-ph depending on their submission time. The
174:     horizontal error bars denote the width of the sample bins. The
175:     green horizontal line corresponds to the average citations (with
176:     dotted error bars) of e-Prints at positions 26--30.}
177:   \label{fig:timing}
178: \end{figure}
179: 
180: We repeat this calculation for three additional time intervals and
181: plot the result in Fig.~\ref{fig:timing}. We find that after the
182: initial rush of self-promoted papers the citation rates slowly drop
183: for e-Prints submitted later after the deadline. This confirms a
184: contribution of SP to the PCE. We also find that papers submitted more
185: than 1.5\,h after the deadline, i.e., those e-Prints that achieved a
186: high position in the astro-ph listing almost certainly by chance and
187: not by the submitter's intent, are still cited significantly more
188: often ($3\sigma$) than papers further down the list. This proves that
189: also VB contributes to the PCE. We note that these results are
190: independent of the exact binning that is employed. Choosing different
191: bins close to and far away from the submission deadline moves the
192: points in Fig.~\ref{fig:timing} somewhat up and down. The overall
193: result that papers submitted shortly after the deadline have higher
194: citation rates than e-Prints submitter later remains, as does the
195: difference in citation numbers between late articles at the top and
196: articles down the astro-ph listing. The presence of both SP and VB
197: thus does not depend on the exact binning employed.
198: 
199: 
200: 
201: \section{Summary and Discussion}
202: \label{sec:concl}
203: %
204: We studied the factors contributing to the increased number of
205: citations e-Prints at the top of the daily astro-ph listing receive
206: compared to e-Prints listed further down the astro-ph mailing. By
207: making a timing argument we constructed samples of e-Prints appearing
208: at the three first positions of astro-ph that are either (1) almost
209: certainly submitted with the intent of getting the top spot; or (2)
210: achieved a high position of astro-ph purely by chance; or (3) fall
211: somewhere in between and have a mixture of categories.
212: 
213: We found that the sample of self-promoted papers indeed has the
214: highest citation rates. This shows that self-promotion as a mechanism
215: that preferentially puts intrinsically more citable papers to the top
216: of the astro-ph listings in fact works. This is not surprising,
217: considering that \citet{2005IPM....41.1395K} found evidence for a
218: self-selection bias in which papers authors post on astro-ph. We, in
219: turn, find a similar effect within the e-Prints on astro-ph.
220: 
221: Arguably, the more important finding of this work is the difference in
222: citation rates between the not self-promoted sample of e-Prints and
223: articles appearing much lower in the astro-ph mailing. The citation
224: rates for the late sample are lower than for the self-promoted sample
225: but still significantly higher than for articles lower in the astro-ph
226: mailing. This provides strong evidence for the visibility bias theory
227: that articles are cited more often, not due to some inherent quality
228: they have, but simply because they are at the top of the astro-ph
229: listing.
230: 
231: Citation counts are often used to evaluate the scientific quality of
232: individuals or institutions and hiring or funding decisions are partly
233: based on them. Our finding that a visibility bias exists at the top of
234: arXiv:astro-ph should provide a strong cautionary note concerning the
235: use of such statistics. We also note that the fraction of astro-ph
236: e-Prints submitted very shortly after the deadline increased during
237: the interval under study here. In the second half of 2002 1.5\%
238: (2.9\%) of all e-Prints were submitted within the first 60\,s
239: (300\,s). In the second half of 2005 these numbers rose to 2.3\%
240: (4.6\%). The chance that this is a statistical fluctuation is smaller
241: than 0.01\%. This change in submission behavior appears to be
242: indicative of a growing feeling in the astronomical community that VB
243: plays a role and that citations are not awarded purely on merit of the
244: work presented in a paper.
245: 
246: One could simply get rid of VB by randomizing the order of the
247: astro-ph listing for every reader. In this way the VB would average
248: out over the readership of astro-ph. However, by doing so one would
249: ignore the underlying problem that leads to VB in the first
250: place. Everyday astronomers are confronted with an enormous amount of
251: new information, which they have to sort, classify, and ultimately
252: decide what is of relevance for their own research. 
253: 
254: Publications are never cited without a reason, i.e., VB does not lead
255: to unjustified additional citations of a paper. Thus, we must draw the
256: conclusion that papers down the astro-ph list are overlooked and not
257: cited when they should be. Any randomization would mitigate the VB
258: problem by changing the set of papers that does not get the attention
259: it deserves but it would not fix the real problem, i.e., the
260: information overload which astronomers face every day. It is important
261: to realize that the VB effect on citations is only a secondary
262: effect. The primary effect, from which the citation inequality
263: follows, is that researchers are not aware of relevant publications
264: and results in their own field. This is potentially an impediment
265: for science and the real problem that needs fixing. Since we cannot
266: expect the number of publications to decrease, the solution has to be
267: in the way information is presented. 
268: 
269: Only a relatively small subset of e-Prints is relevant to any
270: individual researcher and the daily challenge is to identify these in
271: the much larger astro-ph listing. A possible first step in this
272: direction is the arxivsorter \citep{2007Arxivsorter..M}, which we
273: already mentioned in \citetalias{2008PASP..120..224D}. Arxivsorter
274: aims to sort daily, recent, or monthly astro-ph listings by relevance
275: to an individual reader. The underlying idea is that scientists
276: through co-authorship form an interconnected network of authors. By
277: specifying a few authors relevant to a reader's fields of interest,
278: the ``proximity'' of a new e-print in the author network can be
279: calculated. This proximity seems to be a good proxy for relevance to a
280: reader's interests.
281: 
282: 
283: \acknowledgements The original submission times of e-Prints were
284: provided by Paul Ginsparg. I thank Bruno Leibundgut, Brice M\'enard,
285: Uta Grothkopf, and the anonymous referee for comments that helped to
286: improve the manuscript. This research has made use of NASA's
287: Astrophysics Data System Bibliographic Services.
288: 
289: \begin{thebibliography}{4}
290: \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
291: 
292: \bibitem[{{Dietrich}(2008)}]{2008PASP..120..224D}
293: {Dietrich}, J.~P. 2008, \pasp, 120, 224
294: 
295: \bibitem[{{Habing}(2007)}]{2007EurRev..15..3S}
296: {Habing}, H. 2007, {European Review}, 15, 3
297: 
298: \bibitem[{{Kurtz} {et~al.}(2005){Kurtz}, {Eichhorn}, {Accomazzi}, {Grant},
299:   {Demleitner}, {Henneken}, \& {Murray}}]{2005IPM....41.1395K}
300: {Kurtz}, M.~J., {Eichhorn}, G., {Accomazzi}, A., {Grant}, C., {Demleitner}, M.,
301:   {Henneken}, E., \& {Murray}, S.~S. 2005, Information Processing and
302:   Management, 41, 1395
303: 
304: \bibitem[{{Magu\'e} \& {M\'enard}(2007)}]{2007Arxivsorter..M}
305: {Magu\'e}, J.-P. \& {M\'enard}, B. 2007, {Arxivsorter documentation, {\tt
306:   http://arxivsorter.org/doc}}
307: 
308: \bibitem[{{Redner}(1998)}]{1998EPJB....4..131R}
309: {Redner}, S. 1998, European Physical Journal B, 4, 131
310: 
311: \end{thebibliography}
312: 
313: 
314: \end{document}
315: