0504:cs0504036/iq.tex

1: \documentclass[11pt]{article}

2: \usepackage{graphicx}

3:  \usepackage{setspace}

4: \singlespacing

5:

6: \newcommand{\pdfig}[2]{

7: 	\begin{figure}

8: 	\leavevmode

9: 	% \centerline{\includegraphics[width=\columnwidth]{figs/#2.pdf}}

10:         % arXiv seems to need figures in same directory?

11:         %       and not like PDF figures?! - accepted, but causes truncation

12:         %       and not like PNG or JPG figures?! -rejected?!

13: 	\centerline{\includegraphics[width=\columnwidth]{#2.eps}}

14: 	\caption{#1}\label{fig:#2}

15: 	\end{figure}

16: 	\arabic{figure}}

17:

18:

19: \newcommand{\marg}[1]{\dag

20:        \marginpar{\small \em \flushleft

21:                {\renewcommand{\baselinestretch}{1.0} \dag #1}}}

22:

23: \newcommand{\GS}{{\em GoogleScholar}\ }

24: \newcommand{\ISI}{{\em ISI}\ }

25:

26: \begin{document}

27:

28: \title{Scientific impact quantity and quality: \\

29: Analysis of two  sources of bibliographic data}

30:

31: \author{Richard K. Belew \\

32: Cognitive Science Dept. \\

33: Univ. California -- San Diego\\

34: La Jolla CA 92093-0515 USA}

35:

36: \date{10 April 2005}

37:

38: \maketitle

39:

40: \begin{center}{\bf arXiv\#: CoRR/0504036} \end{center}

41:

42: % ACM-CR Cat: H.3.3 Information Search and Retrieval/Retrieval models; H.3.7 Digital Libraries; H.5.4 Hypertext/Hypermedia

43:

44: \begin{quotation}

45: \begin{center}{\bf Abstract} \end{center}

46:

47: Attempts to understand the consequence of any individual scientist's

48: activity within the long-term trajectory of science is one of the most

49: difficult questions within the philosophy of science.  Because

50: scientific publications play such as central role in the modern

51: enterprise of science, bibliometric techniques which measure the

52: ``impact'' of an individual publication as a function of the number of

53: citations it receives from subsequent authors have provided some of

54: the most useful empirical data on this question.  Until recently,

55: Thompson/ISI has provided the only source of large-scale ``inverted''

56: bibliographic data of the sort required for impact analysis.  In the

57: end  of 2004, Google introduced a new service, GoogleScholar,

58: making much of this same data available.  Here we analyze 203

59: publications, collectively cited by more than 4000 other publications.

60: We show surprisingly good agreement between data citation counts

61: provided by the two services.  Data quality across the systems is

62: analyzed, and potentially useful complementarities between are

63: considered.  The additional robustness offered by multiple sources of

64: such data promises to increase the utility of these measurements as

65: open citation protocols and open access increase their impact on

66: electronic scientific publication practices.

67:

68: \end{quotation}

69:

70: \pagebreak

71: % \doublespacing

72:

73: \section{Background}

74:

75: Bibliometric analysis of scientific publications goes back to at least

76: the 1970s \cite{REF621,Price86,REF620}; similar analysis of judicial

77: opinions has been done by Shepards/LexisNexis for more than a hundred

78: years.  The Institute for Scientific Information has made an

79: industry of providing citation data to libraries since the mid-1960s;

80: the products are currently available as part of Thomson/ISI (\ISI).

81: \ISI reports that they currently index 16,000 journals, books and

82: proceedings \cite{garfield98}.  While far from exhaustive (ISI estimates that of the

83: 2000 new journals reviewed annually, only 10\% are selected), the

84: service cites ``Bradford's Law'' that a relatively small number of

85: sources capture the bulk of significant scientific results.  All

86: articles appearing in selected publications have their bibliographies

87: manually transcribed, and ``inverted bibliographies'' pointing from a

88: (earlier) cited work to all (subsequent) citing publications is

89: generated to support users' searches.  Critically, the translation of these

90: bibliographies into distinct records involves a great deal of {\em

91:   manual} effort.

92:

93: May  has reported extensive analyses of British

94: scientific activity in comparison with other countries, primarily

95: based on \ISI's data \cite{may97b,may97a}.  ``The database has many shortcomings and

96: biases, but overall it gives a wide coverage of most fields.''  \cite[p. 793]{may97a}

97: His critique of shortcomings in this data is useful:

98: \begin{quotation}

99:    Some problems have to do with the compilation of the database. It

100:    includes citations of books and chapters in edited books, but it

101:    does not include the citations in such publications.  Other

102:    publications, such as government and other agency reports and

103:    working papers, are essentially omitted.  It does not cover all

104:    significant scientific journals....  Papers that describe technical

105:    methods may attract thousands of reflexive citations, while

106:    path-breaking papers may be cited only slightly for many years.

107:    Review articles can mask the primary papers they review. Citation

108:    patterns vary among fields....  Spectacular scientific errors may

109:    attract many citations....  Self-citation (which accounts for at

110:    least 10\% of all citations) may bias some of the results.  \cite[Footnote 3]{may97a}

111: \end{quotation}

112: Some of these issues (e.g., having to do with the sources being

113: compiled) can be expected to altered by new forms of electronic

114: scientific publication, but others (e.g., self-citation) are likely to

115: be more intrinsic to scientific authoring processes.  It is for this

116: reason that Google's recent announcement of their Scholar.Google(beta)

117: (\GS) service is welcome, as a second, independent source of similar

118: data.

119:

120: While specifics concerning Google's operation are difficult to come

121: by, it is reasonable to assuem that the process relies on more {\em

122:   automatic}, algorithmic procedures than those used by \ISI.  Linkage

123: structure among Web pages  is analogous in important ways to

124: scientific publication \cite{REF1162,Lawrence98}.  These links are

125: captured by Web crawling algorithms as both ``citing'' pages (i.e.,

126: Web pages with HTML anchors pointing to other Web pages) and

127: ``cited'' pages are visited, a feature exploited by Google's original

128: ``PageRank'' retrieval algorithm \cite{Page98}.  \GS attempts to bring

129: similar analyses to academic publication, despite the fact that these

130: source documents are often much less accessible.

131:

132: \section{Methods}

133:

134: Given an author's name\footnote{Translation of an author's name into

135:   search query string(s) can be ambiguous.  In these experiments both

136:   first letter, and first letter with the middle initial together with

137:   full last name was used as the author's name.}, both \ISI and \GS

138: provide search facilities that return a list of publications

139: putatively authored by this individual, together with the number of

140: times each of these publications has been cited by other publications

141: discovered by the service.  Six academics were selected at random and

142: used as ``probe'' queries with both systems. \footnote{These academics

143:   were all drawn from a single, particularly interdisciplinary

144:   academic department.}  Complete bibliographies of all publications

145: by these authors were manually reconciled against 203 references to

146: these publications returned by one or both systems, and then analyzed

147: in detail.  Cumulatively, \ISI discovered 4741 such references, \GS

148: found 4045.

149:

150: Because standards and format of bibliographic citations vary widely

151: across different publications, the process of reconciling citation

152: strings from different papers to the same target publication is

153: problematic, whether via \ISI's manual process or Google's automatic

154: one.  It is common, therefore, to find the same publication has been

155: treated as more than one record.\footnote{The alternative type of

156:   error, where citations to multiple, distinct publications are

157:   confounded as part of the citation record of a single entry, is

158:   more difficult to identify}

159:

160: For example, manual inspection reveals that a single publication in

161: the ``Proceedings of the 12th Annual Conference of ACM's Special

162: Interest Group in Information Retrieval (SIGIR)" is listed as twelve

163: separate records by \ISI; these are shown in Table 1.  While most

164: citations to this target publication have been conveniently collected

165: with respect to two of these records, such noisy data makes impact

166: analysis difficult.  In these experiments, a publication's ``impact''

167: is defined as the number of citations found to any of the variations

168: resolved to the published work, i.e., the sum is taken across all

169: records (manually) identified as referencing the same publication.

170:

171: \begin{table}

172: \begin{center}

173: \begin{tabular}{|cl|c|}  \hline

174: {\bf PubYear} & {\bf CiteString} & {\bf NCitations} \\ \hline

175: 1989	& 12 ANN INT ACM SIGIR    	& 1 \\

176: 1989	& 12 ANN INT C RES DEV  	&   1 \\

177: 1989	& 12TH P ANN INT ACM S   11	& 14 \\

178: 1989	& 12TH P INT C RES DEV    	& 1 \\

179: 1989	& ACM SIGIR INT C RES    	& 1 \\

180: 1988	& JUN P ACM SIGIR 88 G   11	& 1 \\

181: 1989	& P 11 INT ACM SIGIR C    	& 1 \\

182: 1989	& P 12 ANN INT ACM SIG    	& 2 \\

183: 1989	& P 12 ANN INT ACM SIG   11	& 16 \\

184: 1989	& SIGIR 89   11	& 2 \\

185: 1989	& SIGIR FORUM 23 11	& 1 \\

186: 1990	& SIGOIS B 11 48	& 1 \\ \hline

187: \end{tabular}

188: \caption{Citation variations for same publication}

189: \end{center}

190: \end{table}

191:

192:

193: \section{Results}

194:

195: Figure \pdfig{Redundant citation noise}{chatter2-ai} shows how well both

196: systems aggregate individual citations that in fact to refer to the

197: same published paper.  This shows the cumulative probability that one,

198: two, or more publications listed as distinct to by both systems in

199: fact refer to the same publication.  For example, it shows that more

200: than 60\% of the articles are represented as unique entries within

201: \ISI's listing while 85\% of them are unique with \GS.  None of the

202: articles had more than five separate listings within \GS, while 13\%

203: had five or more entries in \ISI's system (e.g., the example shown in

204: Table 1 had 12).

205:

206: Overlap between the two sources of data was relatively small.  Of the

207: 203 citations analyzed, only 78 publications received at least one

208: cited reference from each system.  However, for this subset the

209: general pattern of agreement was quite good.  Figure

210: \pdfig{Correlation of \GS and \ISI citation counts}{citeCorr4-ai}

211: shows the number of citations reported by \GS and \ISI for the subset

212: of 78 publications.  Note that the number of citations is plotted on a

213: log-log scale, reflecting the well-known power law distribution of

214: citation reference \cite{redner98}. Based on this sample, there seems

215: good evidence ($r^2 = 0.5023, t=8.872, \rho>0.005$) for a power law

216: relation ($GS = 3.1718 * ISI^{0.6359}$) relating the number of

217: citations reported by the two services.

218:

219: Figure \pdfig{Temporal distribution of citations}{yearSumm-ai} shows the

220: cumulative number of citations reported by publication year of the

221: cited work.  An alternative criterion for considering the match

222: between systems is to define a ``miss'' to be a publication for which

223: one service has identified three or more citations, but which the

224: other service does not capture whatsoever.  Figure \pdfig{Temporal

225:   distribution of missing citations}{yearMiss-ai} shows missing

226: citations, found by one service but not the other, again distributed

227: by publication year.  \GS seems competitive in terms of coverage for

228: materials published in the last twenty years; before then \ISI seems

229: to dominate.

230:

231: Coverage with respect to the two systems can also be analyzed by other

232: dimensions of the publications, including publication venue and

233: author.  Figure \pdfig{Coverage by publication type}{typeSumm-ai}

234: aggregates publications into four categories: conference publications,

235: books (or book chapters), journal articles, and other forms of

236: publications (e.g, technical reports, dissertations, etc.); $\chi^2$

237: tests confirm the distributions are distinct.  Publications in books

238: (as noted by May, above) and conference proceedings are much more

239: likely to be available via \GS; conversely, journal articles are

240: better indexed via \ISI.  If citations are summarized with respect to

241: the six authors analyzed, Figure \pdfig{Coverage for individual

242:   authors}{authSumm-ai} shows that some authors are better represented

243: with respect one service as opposed to another.  Such variation is to be

244: expected, given that some authors, via the publication venues through

245: which they typically report, will be more or less well-covered by one

246: service or another.  Again,  $\chi^2$

247: tests confirm the distributions are distinct.

248:

249: \section{Summary}

250:

251: Evaluating academics' performance, as individuals or as part of larger

252: social groups, in terms of the number of publications they produce is

253: common practice.  The ability to quantify their ``impact'' in terms of

254: the number of other publications that subsequently choose to cite

255: their work arguably provides a more refined and relevant measure.

256: Such data is subject, however, to confounding factors ranging from

257: noise in the process of collating and ``inverting'' bibliographic

258: references through intrinsic features of scientific publication (e.g.,

259: self-citation).  The results presented above are therefore reassuring

260: in that new evidence provided by \GS provides the first

261: independent confirmation of impact data previously available only from

262: \ISI.  However, analysis across both systems also shows significant

263: variations with respect to the two dimensions (authorship and

264: publication type) considered; other dimensions of variation are

265: certain to exist.  This analysis also revealed some problems common to

266: both systems.  For example, both services support only simple ASCII

267: encodings of author names which are likely to lose important character

268: markup (available via Unicode representations) which can be especially

269: problematic for authors with foreign names.

270:

271: Critically, new services within selected disciplines \cite{acmPortal,ieeeDL},

272: changing standards regarding exchange of ``open citation'' information

273: \cite{CrossRef}, in combination with increased pressure for public access to

274: scientific publications \cite{Zerhouni04}, may soon make some

275: operational difficulties associated with impact analysis obsolete.

276: In the interim, academic deans,

277: science policy advisors and anyone else relying on citation count

278: data  are cautioned that any

279: individual measurement requires more context.

280: In the longer term, the increased

281: availability of statistics like bibliographic impact makes it

282: increasingly important to understand how publication and citation

283: activities, within both scientific publication and Web publishing more

284: generally, can be included as part of more holistic evaluations of

285: intellectual contribution \cite{grant00}.

286:

287: \bibliographystyle{plain}

288: \bibliography{iq,/Users/rik/Writing/FOA/Manuscript/foa.bib,/Users/rik/Writing/Biblio/belew.bib}

289:

290: \end{document}

291: