cs0504036/iq.tex
1: \documentclass[11pt]{article}
2: \usepackage{graphicx}
3:  \usepackage{setspace}
4: \singlespacing
5: 
6: \newcommand{\pdfig}[2]{
7: 	\begin{figure}
8: 	\leavevmode	
9: 	% \centerline{\includegraphics[width=\columnwidth]{figs/#2.pdf}}
10:         % arXiv seems to need figures in same directory?
11:         %       and not like PDF figures?! - accepted, but causes truncation
12:         %       and not like PNG or JPG figures?! -rejected?!
13: 	\centerline{\includegraphics[width=\columnwidth]{#2.eps}}
14: 	\caption{#1}\label{fig:#2}
15: 	\end{figure}
16: 	\arabic{figure}}
17: 
18: 
19: \newcommand{\marg}[1]{\dag
20:        \marginpar{\small \em \flushleft
21:                {\renewcommand{\baselinestretch}{1.0} \dag #1}}}
22: 
23: \newcommand{\GS}{{\em GoogleScholar}\ }
24: \newcommand{\ISI}{{\em ISI}\ }
25: 
26: \begin{document}
27: 
28: \title{Scientific impact quantity and quality: \\
29: Analysis of two  sources of bibliographic data}
30: 
31: \author{Richard K. Belew \\
32: Cognitive Science Dept. \\
33: Univ. California -- San Diego\\
34: La Jolla CA 92093-0515 USA}
35: 
36: \date{10 April 2005}
37: 
38: \maketitle
39: 
40: \begin{center}{\bf arXiv\#: CoRR/0504036} \end{center}
41: 
42: % ACM-CR Cat: H.3.3 Information Search and Retrieval/Retrieval models; H.3.7 Digital Libraries; H.5.4 Hypertext/Hypermedia
43: 
44: \begin{quotation}
45: \begin{center}{\bf Abstract} \end{center}
46: 
47: Attempts to understand the consequence of any individual scientist's
48: activity within the long-term trajectory of science is one of the most
49: difficult questions within the philosophy of science.  Because
50: scientific publications play such as central role in the modern
51: enterprise of science, bibliometric techniques which measure the
52: ``impact'' of an individual publication as a function of the number of
53: citations it receives from subsequent authors have provided some of
54: the most useful empirical data on this question.  Until recently,
55: Thompson/ISI has provided the only source of large-scale ``inverted''
56: bibliographic data of the sort required for impact analysis.  In the
57: end  of 2004, Google introduced a new service, GoogleScholar,
58: making much of this same data available.  Here we analyze 203
59: publications, collectively cited by more than 4000 other publications.
60: We show surprisingly good agreement between data citation counts
61: provided by the two services.  Data quality across the systems is
62: analyzed, and potentially useful complementarities between are
63: considered.  The additional robustness offered by multiple sources of
64: such data promises to increase the utility of these measurements as
65: open citation protocols and open access increase their impact on
66: electronic scientific publication practices.
67: 
68: \end{quotation}
69: 
70: \pagebreak
71: % \doublespacing
72: 
73: \section{Background}
74: 
75: Bibliometric analysis of scientific publications goes back to at least
76: the 1970s \cite{REF621,Price86,REF620}; similar analysis of judicial
77: opinions has been done by Shepards/LexisNexis for more than a hundred
78: years.  The Institute for Scientific Information has made an
79: industry of providing citation data to libraries since the mid-1960s;
80: the products are currently available as part of Thomson/ISI (\ISI).
81: \ISI reports that they currently index 16,000 journals, books and
82: proceedings \cite{garfield98}.  While far from exhaustive (ISI estimates that of the
83: 2000 new journals reviewed annually, only 10\% are selected), the
84: service cites ``Bradford's Law'' that a relatively small number of
85: sources capture the bulk of significant scientific results.  All
86: articles appearing in selected publications have their bibliographies
87: manually transcribed, and ``inverted bibliographies'' pointing from a
88: (earlier) cited work to all (subsequent) citing publications is
89: generated to support users' searches.  Critically, the translation of these
90: bibliographies into distinct records involves a great deal of {\em
91:   manual} effort.
92: 
93: May  has reported extensive analyses of British
94: scientific activity in comparison with other countries, primarily
95: based on \ISI's data \cite{may97b,may97a}.  ``The database has many shortcomings and
96: biases, but overall it gives a wide coverage of most fields.''  \cite[p. 793]{may97a}
97: His critique of shortcomings in this data is useful:
98: \begin{quotation}
99:    Some problems have to do with the compilation of the database. It
100:    includes citations of books and chapters in edited books, but it
101:    does not include the citations in such publications.  Other
102:    publications, such as government and other agency reports and
103:    working papers, are essentially omitted.  It does not cover all
104:    significant scientific journals....  Papers that describe technical
105:    methods may attract thousands of reflexive citations, while
106:    path-breaking papers may be cited only slightly for many years.
107:    Review articles can mask the primary papers they review. Citation
108:    patterns vary among fields....  Spectacular scientific errors may
109:    attract many citations....  Self-citation (which accounts for at
110:    least 10\% of all citations) may bias some of the results.  \cite[Footnote 3]{may97a}
111: \end{quotation}
112: Some of these issues (e.g., having to do with the sources being
113: compiled) can be expected to altered by new forms of electronic
114: scientific publication, but others (e.g., self-citation) are likely to
115: be more intrinsic to scientific authoring processes.  It is for this
116: reason that Google's recent announcement of their Scholar.Google(beta)
117: (\GS) service is welcome, as a second, independent source of similar
118: data.
119: 
120: While specifics concerning Google's operation are difficult to come
121: by, it is reasonable to assuem that the process relies on more {\em
122:   automatic}, algorithmic procedures than those used by \ISI.  Linkage
123: structure among Web pages  is analogous in important ways to
124: scientific publication \cite{REF1162,Lawrence98}.  These links are
125: captured by Web crawling algorithms as both ``citing'' pages (i.e.,
126: Web pages with HTML anchors pointing to other Web pages) and
127: ``cited'' pages are visited, a feature exploited by Google's original
128: ``PageRank'' retrieval algorithm \cite{Page98}.  \GS attempts to bring
129: similar analyses to academic publication, despite the fact that these
130: source documents are often much less accessible.
131: 
132: \section{Methods}
133: 
134: Given an author's name\footnote{Translation of an author's name into
135:   search query string(s) can be ambiguous.  In these experiments both
136:   first letter, and first letter with the middle initial together with
137:   full last name was used as the author's name.}, both \ISI and \GS
138: provide search facilities that return a list of publications
139: putatively authored by this individual, together with the number of
140: times each of these publications has been cited by other publications
141: discovered by the service.  Six academics were selected at random and
142: used as ``probe'' queries with both systems. \footnote{These academics
143:   were all drawn from a single, particularly interdisciplinary
144:   academic department.}  Complete bibliographies of all publications
145: by these authors were manually reconciled against 203 references to
146: these publications returned by one or both systems, and then analyzed
147: in detail.  Cumulatively, \ISI discovered 4741 such references, \GS
148: found 4045.
149: 
150: Because standards and format of bibliographic citations vary widely
151: across different publications, the process of reconciling citation
152: strings from different papers to the same target publication is
153: problematic, whether via \ISI's manual process or Google's automatic
154: one.  It is common, therefore, to find the same publication has been
155: treated as more than one record.\footnote{The alternative type of
156:   error, where citations to multiple, distinct publications are
157:   confounded as part of the citation record of a single entry, is 
158:   more difficult to identify}
159: 
160: For example, manual inspection reveals that a single publication in
161: the ``Proceedings of the 12th Annual Conference of ACM's Special
162: Interest Group in Information Retrieval (SIGIR)" is listed as twelve
163: separate records by \ISI; these are shown in Table 1.  While most
164: citations to this target publication have been conveniently collected
165: with respect to two of these records, such noisy data makes impact
166: analysis difficult.  In these experiments, a publication's ``impact''
167: is defined as the number of citations found to any of the variations
168: resolved to the published work, i.e., the sum is taken across all
169: records (manually) identified as referencing the same publication.
170: 
171: \begin{table}
172: \begin{center}
173: \begin{tabular}{|cl|c|}  \hline
174: {\bf PubYear} & {\bf CiteString} & {\bf NCitations} \\ \hline
175: 1989	& 12 ANN INT ACM SIGIR    	& 1 \\
176: 1989	& 12 ANN INT C RES DEV  	&   1 \\
177: 1989	& 12TH P ANN INT ACM S   11	& 14 \\
178: 1989	& 12TH P INT C RES DEV    	& 1 \\
179: 1989	& ACM SIGIR INT C RES    	& 1 \\
180: 1988	& JUN P ACM SIGIR 88 G   11	& 1 \\
181: 1989	& P 11 INT ACM SIGIR C    	& 1 \\
182: 1989	& P 12 ANN INT ACM SIG    	& 2 \\
183: 1989	& P 12 ANN INT ACM SIG   11	& 16 \\
184: 1989	& SIGIR 89   11	& 2 \\
185: 1989	& SIGIR FORUM 23 11	& 1 \\
186: 1990	& SIGOIS B 11 48	& 1 \\ \hline
187: \end{tabular}
188: \caption{Citation variations for same publication}
189: \end{center}
190: \end{table}
191: 
192: 
193: \section{Results}
194: 
195: Figure \pdfig{Redundant citation noise}{chatter2-ai} shows how well both
196: systems aggregate individual citations that in fact to refer to the
197: same published paper.  This shows the cumulative probability that one,
198: two, or more publications listed as distinct to by both systems in
199: fact refer to the same publication.  For example, it shows that more
200: than 60\% of the articles are represented as unique entries within
201: \ISI's listing while 85\% of them are unique with \GS.  None of the
202: articles had more than five separate listings within \GS, while 13\%
203: had five or more entries in \ISI's system (e.g., the example shown in
204: Table 1 had 12).
205: 
206: Overlap between the two sources of data was relatively small.  Of the
207: 203 citations analyzed, only 78 publications received at least one
208: cited reference from each system.  However, for this subset the
209: general pattern of agreement was quite good.  Figure
210: \pdfig{Correlation of \GS and \ISI citation counts}{citeCorr4-ai}
211: shows the number of citations reported by \GS and \ISI for the subset
212: of 78 publications.  Note that the number of citations is plotted on a
213: log-log scale, reflecting the well-known power law distribution of
214: citation reference \cite{redner98}. Based on this sample, there seems
215: good evidence ($r^2 = 0.5023, t=8.872, \rho>0.005$) for a power law
216: relation ($GS = 3.1718 * ISI^{0.6359}$) relating the number of
217: citations reported by the two services.
218: 
219: Figure \pdfig{Temporal distribution of citations}{yearSumm-ai} shows the
220: cumulative number of citations reported by publication year of the
221: cited work.  An alternative criterion for considering the match
222: between systems is to define a ``miss'' to be a publication for which
223: one service has identified three or more citations, but which the
224: other service does not capture whatsoever.  Figure \pdfig{Temporal
225:   distribution of missing citations}{yearMiss-ai} shows missing
226: citations, found by one service but not the other, again distributed
227: by publication year.  \GS seems competitive in terms of coverage for
228: materials published in the last twenty years; before then \ISI seems
229: to dominate.
230: 
231: Coverage with respect to the two systems can also be analyzed by other
232: dimensions of the publications, including publication venue and
233: author.  Figure \pdfig{Coverage by publication type}{typeSumm-ai}
234: aggregates publications into four categories: conference publications,
235: books (or book chapters), journal articles, and other forms of
236: publications (e.g, technical reports, dissertations, etc.); $\chi^2$
237: tests confirm the distributions are distinct.  Publications in books
238: (as noted by May, above) and conference proceedings are much more
239: likely to be available via \GS; conversely, journal articles are
240: better indexed via \ISI.  If citations are summarized with respect to
241: the six authors analyzed, Figure \pdfig{Coverage for individual
242:   authors}{authSumm-ai} shows that some authors are better represented
243: with respect one service as opposed to another.  Such variation is to be
244: expected, given that some authors, via the publication venues through
245: which they typically report, will be more or less well-covered by one
246: service or another.  Again,  $\chi^2$
247: tests confirm the distributions are distinct.  
248: 
249: \section{Summary}
250: 
251: Evaluating academics' performance, as individuals or as part of larger
252: social groups, in terms of the number of publications they produce is
253: common practice.  The ability to quantify their ``impact'' in terms of
254: the number of other publications that subsequently choose to cite
255: their work arguably provides a more refined and relevant measure.
256: Such data is subject, however, to confounding factors ranging from
257: noise in the process of collating and ``inverting'' bibliographic
258: references through intrinsic features of scientific publication (e.g.,
259: self-citation).  The results presented above are therefore reassuring
260: in that new evidence provided by \GS provides the first
261: independent confirmation of impact data previously available only from
262: \ISI.  However, analysis across both systems also shows significant
263: variations with respect to the two dimensions (authorship and
264: publication type) considered; other dimensions of variation are
265: certain to exist.  This analysis also revealed some problems common to
266: both systems.  For example, both services support only simple ASCII
267: encodings of author names which are likely to lose important character
268: markup (available via Unicode representations) which can be especially
269: problematic for authors with foreign names.
270: 
271: Critically, new services within selected disciplines \cite{acmPortal,ieeeDL},
272: changing standards regarding exchange of ``open citation'' information
273: \cite{CrossRef}, in combination with increased pressure for public access to
274: scientific publications \cite{Zerhouni04}, may soon make some
275: operational difficulties associated with impact analysis obsolete. 
276: In the interim, academic deans,
277: science policy advisors and anyone else relying on citation count
278: data  are cautioned that any
279: individual measurement requires more context.  
280: In the longer term, the increased
281: availability of statistics like bibliographic impact makes it
282: increasingly important to understand how publication and citation
283: activities, within both scientific publication and Web publishing more
284: generally, can be included as part of more holistic evaluations of
285: intellectual contribution \cite{grant00}.
286: 
287: \bibliographystyle{plain}
288: \bibliography{iq,/Users/rik/Writing/FOA/Manuscript/foa.bib,/Users/rik/Writing/Biblio/belew.bib}
289: 
290: \end{document}
291: