1: \documentclass[11pt]{article}
2: \usepackage{graphicx}
3: \usepackage{setspace}
4: \singlespacing
5:
6: \newcommand{\pdfig}[2]{
7: \begin{figure}
8: \leavevmode
9: % \centerline{\includegraphics[width=\columnwidth]{figs/#2.pdf}}
10: % arXiv seems to need figures in same directory?
11: % and not like PDF figures?! - accepted, but causes truncation
12: % and not like PNG or JPG figures?! -rejected?!
13: \centerline{\includegraphics[width=\columnwidth]{#2.eps}}
14: \caption{#1}\label{fig:#2}
15: \end{figure}
16: \arabic{figure}}
17:
18:
19: \newcommand{\marg}[1]{\dag
20: \marginpar{\small \em \flushleft
21: {\renewcommand{\baselinestretch}{1.0} \dag #1}}}
22:
23: \newcommand{\GS}{{\em GoogleScholar}\ }
24: \newcommand{\ISI}{{\em ISI}\ }
25:
26: \begin{document}
27:
28: \title{Scientific impact quantity and quality: \\
29: Analysis of two sources of bibliographic data}
30:
31: \author{Richard K. Belew \\
32: Cognitive Science Dept. \\
33: Univ. California -- San Diego\\
34: La Jolla CA 92093-0515 USA}
35:
36: \date{10 April 2005}
37:
38: \maketitle
39:
40: \begin{center}{\bf arXiv\#: CoRR/0504036} \end{center}
41:
42: % ACM-CR Cat: H.3.3 Information Search and Retrieval/Retrieval models; H.3.7 Digital Libraries; H.5.4 Hypertext/Hypermedia
43:
44: \begin{quotation}
45: \begin{center}{\bf Abstract} \end{center}
46:
47: Attempts to understand the consequence of any individual scientist's
48: activity within the long-term trajectory of science is one of the most
49: difficult questions within the philosophy of science. Because
50: scientific publications play such as central role in the modern
51: enterprise of science, bibliometric techniques which measure the
52: ``impact'' of an individual publication as a function of the number of
53: citations it receives from subsequent authors have provided some of
54: the most useful empirical data on this question. Until recently,
55: Thompson/ISI has provided the only source of large-scale ``inverted''
56: bibliographic data of the sort required for impact analysis. In the
57: end of 2004, Google introduced a new service, GoogleScholar,
58: making much of this same data available. Here we analyze 203
59: publications, collectively cited by more than 4000 other publications.
60: We show surprisingly good agreement between data citation counts
61: provided by the two services. Data quality across the systems is
62: analyzed, and potentially useful complementarities between are
63: considered. The additional robustness offered by multiple sources of
64: such data promises to increase the utility of these measurements as
65: open citation protocols and open access increase their impact on
66: electronic scientific publication practices.
67:
68: \end{quotation}
69:
70: \pagebreak
71: % \doublespacing
72:
73: \section{Background}
74:
75: Bibliometric analysis of scientific publications goes back to at least
76: the 1970s \cite{REF621,Price86,REF620}; similar analysis of judicial
77: opinions has been done by Shepards/LexisNexis for more than a hundred
78: years. The Institute for Scientific Information has made an
79: industry of providing citation data to libraries since the mid-1960s;
80: the products are currently available as part of Thomson/ISI (\ISI).
81: \ISI reports that they currently index 16,000 journals, books and
82: proceedings \cite{garfield98}. While far from exhaustive (ISI estimates that of the
83: 2000 new journals reviewed annually, only 10\% are selected), the
84: service cites ``Bradford's Law'' that a relatively small number of
85: sources capture the bulk of significant scientific results. All
86: articles appearing in selected publications have their bibliographies
87: manually transcribed, and ``inverted bibliographies'' pointing from a
88: (earlier) cited work to all (subsequent) citing publications is
89: generated to support users' searches. Critically, the translation of these
90: bibliographies into distinct records involves a great deal of {\em
91: manual} effort.
92:
93: May has reported extensive analyses of British
94: scientific activity in comparison with other countries, primarily
95: based on \ISI's data \cite{may97b,may97a}. ``The database has many shortcomings and
96: biases, but overall it gives a wide coverage of most fields.'' \cite[p. 793]{may97a}
97: His critique of shortcomings in this data is useful:
98: \begin{quotation}
99: Some problems have to do with the compilation of the database. It
100: includes citations of books and chapters in edited books, but it
101: does not include the citations in such publications. Other
102: publications, such as government and other agency reports and
103: working papers, are essentially omitted. It does not cover all
104: significant scientific journals.... Papers that describe technical
105: methods may attract thousands of reflexive citations, while
106: path-breaking papers may be cited only slightly for many years.
107: Review articles can mask the primary papers they review. Citation
108: patterns vary among fields.... Spectacular scientific errors may
109: attract many citations.... Self-citation (which accounts for at
110: least 10\% of all citations) may bias some of the results. \cite[Footnote 3]{may97a}
111: \end{quotation}
112: Some of these issues (e.g., having to do with the sources being
113: compiled) can be expected to altered by new forms of electronic
114: scientific publication, but others (e.g., self-citation) are likely to
115: be more intrinsic to scientific authoring processes. It is for this
116: reason that Google's recent announcement of their Scholar.Google(beta)
117: (\GS) service is welcome, as a second, independent source of similar
118: data.
119:
120: While specifics concerning Google's operation are difficult to come
121: by, it is reasonable to assuem that the process relies on more {\em
122: automatic}, algorithmic procedures than those used by \ISI. Linkage
123: structure among Web pages is analogous in important ways to
124: scientific publication \cite{REF1162,Lawrence98}. These links are
125: captured by Web crawling algorithms as both ``citing'' pages (i.e.,
126: Web pages with HTML anchors pointing to other Web pages) and
127: ``cited'' pages are visited, a feature exploited by Google's original
128: ``PageRank'' retrieval algorithm \cite{Page98}. \GS attempts to bring
129: similar analyses to academic publication, despite the fact that these
130: source documents are often much less accessible.
131:
132: \section{Methods}
133:
134: Given an author's name\footnote{Translation of an author's name into
135: search query string(s) can be ambiguous. In these experiments both
136: first letter, and first letter with the middle initial together with
137: full last name was used as the author's name.}, both \ISI and \GS
138: provide search facilities that return a list of publications
139: putatively authored by this individual, together with the number of
140: times each of these publications has been cited by other publications
141: discovered by the service. Six academics were selected at random and
142: used as ``probe'' queries with both systems. \footnote{These academics
143: were all drawn from a single, particularly interdisciplinary
144: academic department.} Complete bibliographies of all publications
145: by these authors were manually reconciled against 203 references to
146: these publications returned by one or both systems, and then analyzed
147: in detail. Cumulatively, \ISI discovered 4741 such references, \GS
148: found 4045.
149:
150: Because standards and format of bibliographic citations vary widely
151: across different publications, the process of reconciling citation
152: strings from different papers to the same target publication is
153: problematic, whether via \ISI's manual process or Google's automatic
154: one. It is common, therefore, to find the same publication has been
155: treated as more than one record.\footnote{The alternative type of
156: error, where citations to multiple, distinct publications are
157: confounded as part of the citation record of a single entry, is
158: more difficult to identify}
159:
160: For example, manual inspection reveals that a single publication in
161: the ``Proceedings of the 12th Annual Conference of ACM's Special
162: Interest Group in Information Retrieval (SIGIR)" is listed as twelve
163: separate records by \ISI; these are shown in Table 1. While most
164: citations to this target publication have been conveniently collected
165: with respect to two of these records, such noisy data makes impact
166: analysis difficult. In these experiments, a publication's ``impact''
167: is defined as the number of citations found to any of the variations
168: resolved to the published work, i.e., the sum is taken across all
169: records (manually) identified as referencing the same publication.
170:
171: \begin{table}
172: \begin{center}
173: \begin{tabular}{|cl|c|} \hline
174: {\bf PubYear} & {\bf CiteString} & {\bf NCitations} \\ \hline
175: 1989 & 12 ANN INT ACM SIGIR & 1 \\
176: 1989 & 12 ANN INT C RES DEV & 1 \\
177: 1989 & 12TH P ANN INT ACM S 11 & 14 \\
178: 1989 & 12TH P INT C RES DEV & 1 \\
179: 1989 & ACM SIGIR INT C RES & 1 \\
180: 1988 & JUN P ACM SIGIR 88 G 11 & 1 \\
181: 1989 & P 11 INT ACM SIGIR C & 1 \\
182: 1989 & P 12 ANN INT ACM SIG & 2 \\
183: 1989 & P 12 ANN INT ACM SIG 11 & 16 \\
184: 1989 & SIGIR 89 11 & 2 \\
185: 1989 & SIGIR FORUM 23 11 & 1 \\
186: 1990 & SIGOIS B 11 48 & 1 \\ \hline
187: \end{tabular}
188: \caption{Citation variations for same publication}
189: \end{center}
190: \end{table}
191:
192:
193: \section{Results}
194:
195: Figure \pdfig{Redundant citation noise}{chatter2-ai} shows how well both
196: systems aggregate individual citations that in fact to refer to the
197: same published paper. This shows the cumulative probability that one,
198: two, or more publications listed as distinct to by both systems in
199: fact refer to the same publication. For example, it shows that more
200: than 60\% of the articles are represented as unique entries within
201: \ISI's listing while 85\% of them are unique with \GS. None of the
202: articles had more than five separate listings within \GS, while 13\%
203: had five or more entries in \ISI's system (e.g., the example shown in
204: Table 1 had 12).
205:
206: Overlap between the two sources of data was relatively small. Of the
207: 203 citations analyzed, only 78 publications received at least one
208: cited reference from each system. However, for this subset the
209: general pattern of agreement was quite good. Figure
210: \pdfig{Correlation of \GS and \ISI citation counts}{citeCorr4-ai}
211: shows the number of citations reported by \GS and \ISI for the subset
212: of 78 publications. Note that the number of citations is plotted on a
213: log-log scale, reflecting the well-known power law distribution of
214: citation reference \cite{redner98}. Based on this sample, there seems
215: good evidence ($r^2 = 0.5023, t=8.872, \rho>0.005$) for a power law
216: relation ($GS = 3.1718 * ISI^{0.6359}$) relating the number of
217: citations reported by the two services.
218:
219: Figure \pdfig{Temporal distribution of citations}{yearSumm-ai} shows the
220: cumulative number of citations reported by publication year of the
221: cited work. An alternative criterion for considering the match
222: between systems is to define a ``miss'' to be a publication for which
223: one service has identified three or more citations, but which the
224: other service does not capture whatsoever. Figure \pdfig{Temporal
225: distribution of missing citations}{yearMiss-ai} shows missing
226: citations, found by one service but not the other, again distributed
227: by publication year. \GS seems competitive in terms of coverage for
228: materials published in the last twenty years; before then \ISI seems
229: to dominate.
230:
231: Coverage with respect to the two systems can also be analyzed by other
232: dimensions of the publications, including publication venue and
233: author. Figure \pdfig{Coverage by publication type}{typeSumm-ai}
234: aggregates publications into four categories: conference publications,
235: books (or book chapters), journal articles, and other forms of
236: publications (e.g, technical reports, dissertations, etc.); $\chi^2$
237: tests confirm the distributions are distinct. Publications in books
238: (as noted by May, above) and conference proceedings are much more
239: likely to be available via \GS; conversely, journal articles are
240: better indexed via \ISI. If citations are summarized with respect to
241: the six authors analyzed, Figure \pdfig{Coverage for individual
242: authors}{authSumm-ai} shows that some authors are better represented
243: with respect one service as opposed to another. Such variation is to be
244: expected, given that some authors, via the publication venues through
245: which they typically report, will be more or less well-covered by one
246: service or another. Again, $\chi^2$
247: tests confirm the distributions are distinct.
248:
249: \section{Summary}
250:
251: Evaluating academics' performance, as individuals or as part of larger
252: social groups, in terms of the number of publications they produce is
253: common practice. The ability to quantify their ``impact'' in terms of
254: the number of other publications that subsequently choose to cite
255: their work arguably provides a more refined and relevant measure.
256: Such data is subject, however, to confounding factors ranging from
257: noise in the process of collating and ``inverting'' bibliographic
258: references through intrinsic features of scientific publication (e.g.,
259: self-citation). The results presented above are therefore reassuring
260: in that new evidence provided by \GS provides the first
261: independent confirmation of impact data previously available only from
262: \ISI. However, analysis across both systems also shows significant
263: variations with respect to the two dimensions (authorship and
264: publication type) considered; other dimensions of variation are
265: certain to exist. This analysis also revealed some problems common to
266: both systems. For example, both services support only simple ASCII
267: encodings of author names which are likely to lose important character
268: markup (available via Unicode representations) which can be especially
269: problematic for authors with foreign names.
270:
271: Critically, new services within selected disciplines \cite{acmPortal,ieeeDL},
272: changing standards regarding exchange of ``open citation'' information
273: \cite{CrossRef}, in combination with increased pressure for public access to
274: scientific publications \cite{Zerhouni04}, may soon make some
275: operational difficulties associated with impact analysis obsolete.
276: In the interim, academic deans,
277: science policy advisors and anyone else relying on citation count
278: data are cautioned that any
279: individual measurement requires more context.
280: In the longer term, the increased
281: availability of statistics like bibliographic impact makes it
282: increasingly important to understand how publication and citation
283: activities, within both scientific publication and Web publishing more
284: generally, can be included as part of more holistic evaluations of
285: intellectual contribution \cite{grant00}.
286:
287: \bibliographystyle{plain}
288: \bibliography{iq,/Users/rik/Writing/FOA/Manuscript/foa.bib,/Users/rik/Writing/Biblio/belew.bib}
289:
290: \end{document}
291: