0511:cs0511002/P.122.tex

1: %APN3_PROCEEDINGS_FORM%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: %

3: % TEMPLATE.TEX -- APN3 (2003) ASP Conference Proceedings template.

4: %

5: % Derived from ADASS VIII (98) ASP Conference Proceedings template

6: % Updated by N. Manset for ADASS IX (99), F. Primini for ADASS 2000,

7: % D.Bohlender for ADASS 2001, and H. Payne for ADASS XII and LaTeX2e.

8: %

9: % Use this template to create your proceedings paper in LaTeX format

10: % by following the instructions given below.  Much of the input will

11: % be enclosed by braces (i.e., { }).  The percent sign, "%", denotes

12: % the start of a comment; text after it will be ignored by LaTeX.

13: % You might also notice in some of the examples below the use of "\ "

14: % after a period; this prevents LaTeX from interpreting the period as

15: % the end of a sentence and putting extra space after it.

16: %

17: % You should check your paper by processing it with LaTeX.  For

18: % details about how to run LaTeX as well as how to print out the User

19: % Guide, consult the README file.  You should also consult the sample

20: % LaTeX papers, sample1.tex and sample2.tex, for examples of including

21: % figures, html links, special symbols, and other advanced features.

22: %

23: % If you do not have access to the LaTeX software or a laser printer

24: % at your site, you can still prepare your paper following the

25: % instructions in the User Guide.  In such cases, the editors will

26: % process the file and make any necessary editorial adjustments.

27: %

28: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

29: %

30: \documentclass[11pt,twoside]{article}  % Leave intact

31: \usepackage{adassconf}

32:

33: % If you have the old LaTeX 2.09, and not the current LaTeX2e, comment

34: % out the \documentclass and \usepackage lines above and uncomment

35: % the following:

36:

37: %\documentstyle[11pt,twoside,adassconf]{article}

38:

39: \begin{document}   % Leave intact

40:

41: %-----------------------------------------------------------------------

42: %			    Paper ID Code

43: %-----------------------------------------------------------------------

44: % Enter the proper paper identification code.  The ID code for your

45: % paper is the session number associated with your presentation as

46: % published in the official conference proceedings.  You can

47: % find this number locating your abstract in the printed proceedings

48: % that you received at the meeting or on-line at the conference web

49: % site; the ID code is the letter/number sequence proceeding the title

50: % of your presentation.

51: %

52: % This will not appear in your paper; however, it allows different

53: % papers in the proceedings to cross-reference each other.  Note that

54: % you should only have one \paperID, and it should not include a

55: % trailing period.

56: %

57: % EXAMPLE: \paperID{O4-1}

58: % EXAMPLE: \paperID{P7-7}

59: %

60:

61: \paperID{P.122}

62:

63: %-----------------------------------------------------------------------

64: %		            Paper Title

65: %-----------------------------------------------------------------------

66: % Enter the title of the paper.

67: %

68: % EXAMPLE: \title{A Breakthrough in Astronomical Software Development}

69: %

70: % If your title is so long as to fill the page header when you print it,

71: % then please supply a short form as a \titlemark.

72: %

73: % EXAMPLE:

74: %  \title{Rapid Development for Distributed Computing, with Implications

75: %         for the Virtual Observatory}

76: %  \titlemark{Rapid Development for Distributed Computing}

77: %

78:

79: \title{Bibliographic Classification using the ADS Databases}

80: %\titlemark{ }

81:

82: %-----------------------------------------------------------------------

83: %		          Authors of Paper

84: %-----------------------------------------------------------------------

85: % Enter the authors followed by their affiliations.  The \author and

86: % \affil commands may appear multiple times as necessary (see example

87: % below).  List each author by giving the first name or initials first

88: % followed by the last name.  Authors with the same affiliations

89: % should grouped together.

90: %

91: % EXAMPLE: \author{Raymond Plante, Doug Roberts,

92: %                  R.\ M.\ Crutcher\altaffilmark{1}}

93: %          \affil{National Center for Supercomputing Applications,

94: %                 University of Illinois Urbana-Champaign, Urbana, IL

95: %                 61801}

96: %          \author{Tom Troland}

97: %          \affil{University of Kentucky}

98: %

99: %          \altaffiltext{1}{Astronomy Department, UIUC}

100: %

101: % In this example, the first three authors, "Plante", "Roberts", and

102: % "Crutcher" are affiliated with "NCSA".  "Crutcher" has an alternate

103: % affiliation with the "Astronomy Department".  The fourth author,

104: % "Troland", is affiliated with "University of Kentucky"

105:

106: \author{Alberto Accomazzi,

107: Michael J. Kurtz,

108: G\"unther Eichhorn,

109: Edwin Henneken,

110: Carolyn S. Grant,

111: Markus Demleitner,

112: Stephen S. Murray}

113: \affil{Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138}

114:

115: %-----------------------------------------------------------------------

116: %			 Contact Information

117: %-----------------------------------------------------------------------

118: % This information will not appear in the paper but will be used by

119: % the editors in case you need to be contacted concerning your

120: % submission.  Enter your name as the contact along with your email

121: % address.

122: %

123: % EXAMPLE:  \contact{Dennis Crabtree}

124: %           \email{crabtree@cfht.hawaii.edu}

125: %

126:

127: \contact{Alberto Accomazzi }

128: \email{aaccomazzi@cfa.harvard.edu }

129:

130: %-----------------------------------------------------------------------

131: %		      Author Index Specification

132: %-----------------------------------------------------------------------

133: % Specify how each author name should appear in the author index.  The

134: % \paindex{ } should be used to indicate the primary author, and the

135: % \aindex for all other co-authors.  You MUST use the following

136: % syntax:

137: %

138: % SYNTAX:  \aindex{Lastname, F. M.}

139: %

140: % where F is the first initial and M is the second initial (if

141: % used).  This guarantees that authors that appear in multiple papers

142: % will appear only once in the author index.

143: %

144: % EXAMPLE: \paindex{Crabtree, D.}

145: %          \aindex{Manset, N.}

146: %          \aindex{Veillet, C.}

147: %

148: % NOTE: this information is also used to build the author list that

149: % appears in the table of contents.  Authors will be listed in the order

150: % of the \paindex and \aindex commmands.

151: %

152:

153: \paindex{Accomazzi, A. }

154: \aindex{Kurtz, M. J.}

155: \aindex{Eichhorn, G.}

156: \aindex{Henneken, E.}

157: \aindex{Grant, C. S.}

158: \aindex{Demleitner, M.}

159: \aindex{Murray, S. S.}

160:

161: %-----------------------------------------------------------------------

162: %		      Author list for page header

163: %-----------------------------------------------------------------------

164: % Please supply a list of author last names for the page header. in

165: % one of these formats:

166: %

167: % EXAMPLES:

168: % \authormark{Lastname}

169: % \authormark{Lastname1 \& Lastname2}

170: % \authormark{Lastname1, Lastname2, ... \& LastnameN}

171: % \authormark{Lastname et al.}

172: %

173: % Use the "et al." form in the case of seven or more authors, or if

174: % the preferred form is too long to fit in the header.

175:

176: \authormark{Accomazzi et al.}

177:

178: %-----------------------------------------------------------------------

179: %			Subject Index keywords

180: %-----------------------------------------------------------------------

181: % Enter a comma separated list of up to 6 keywords describing your

182: % paper.  These will NOT be printed as part of your paper; however,

183: % they will be used to generate the subject index for the proceedings.

184: % There is no standard list; however, you can consult the indices

185: % for past proceedings (http://adass.org/adass/proceedings/).

186: %

187: % EXAMPLE:  \keywords{visualization, astronomy: radio, parallel

188: %                     computing, AIPS++, Galactic Center}

189: %

190: % In this example, the author noticed that "radio astronomy" appeared

191: % in the ADASS VII Index as "astronomy" being the major keyword and

192: % "radio" as the minor keyword.  The colon is used to introduce another

193: % level into the index.

194:

195: \keywords{Classification, Bibliographies}

196:

197: %-----------------------------------------------------------------------

198: %			       Abstract

199: %-----------------------------------------------------------------------

200: % Type abstract in the space below.  Consult the User Guide and Latex

201: % Information file for a list of supported macros (e.g. for typesetting

202: % special symbols). Do not leave a blank line between \begin{abstract}

203: % and the start of your text.

204:

205: \begin{abstract}          % Leave intact

206:

207: We discuss two techniques used to characterize bibliographic records based

208: on their similarity to and relationship with the contents of the

209: NASA Astrophysics Data System (ADS) databases.

210: The first method has been used to classify input text as

211: being relevant to one or more subject areas based on an analysis of

212: the frequency distribution of its individual words.

213: The second method has been used to classify existing records as

214: being relevant to one or more databases based on the distribution

215: of the papers citing them. Both techniques have proven to be valuable

216: tools in assigning new and existing bibliographic records to different

217: disciplines within the ADS databases.

218:

219: \end{abstract}

220:

221: %-----------------------------------------------------------------------

222: %			      Main Body

223: %-----------------------------------------------------------------------

224: % Place the text for the main body of the paper here.  You should use

225: % the \section command to label the various sections; use of

226: % \subsection is optional.  Significant words in section titles should

227: % be capitalized.  Sections and subsections will be numbered

228: % automatically.

229: %

230: % EXAMPLE:  \section{Introduction}

231: %           ...

232: %           \subsection{Our View of the World}

233: %           ...

234: %           \section{A New Approach}

235: %

236: % It is recommended that you look at the sample papers, sample1.tex

237: % and sample2.tex, for examples for formatting references, footnotes,

238: % figures, equations, html links, lists, and other special features.

239:

240: \section{Overview}

241:

242:

243: The NASA Astrophysics Data System (ADS; Kurtz et al 2000)

244: maintains three main databases

245: of scientific bibliographies: Astronomy, Physics,

246: and the ArXiv e-prints.

247: Over the past few years the ADS has created and maintained a separate

248: ``general'' database containing records which do not readily fit in

249: the three main databases.  The use for the general database is

250: twofold: it servers as a staging area for bibliographic records which may

251: be later incorporated into one of the other databases and it provides

252: a placeholder for those records which, while not being directly related

253: to physics or astronomy, may be cited by or citing them.  For instance,

254: it is not unusual for physics papers to cite articles in chemistry

255: or computer science and vice versa.  The typical

256: use of such a database is to store all records from inter-disciplinary

257: journals such as {\it Science} and {\it Nature}.  While some of the

258: articles published in these journals will be entered in the Astronomy

259: and Physics databases, their full table of contents will always be

260: available in the general database.

261:

262: When new records are provided to the ADS without any meta data enabling

263: them to be reliably labeled as belonging to either physics or astronomy

264: or physics, a decision has to be made in terms of which

265: database they should be assigned to.  Given the sheer amount of

266: bibliographic data being handled by the ADS project

267: (Grant et al 2000), this decision

268: has to be made automatically most of the time.  This paper describes how

269: we have made use of two different tools to help us with the automatic

270: classification of bibliographic records.  The first tool is a

271: text classifier which performs an analysis of textual data based

272: on a well-known Bayesian probabilistic model (McCollum and Nigam 1998).

273: Classification of a document is performed by estimating

274: the likelihood of its membership in a certain database based on the

275: relative frequency of the words from the text in that database.

276: The second tool is a citation classifier which assigns existing ADS

277: records to one or more databases based on how frequently they have

278: been cited by the records in those databases.

279: The underlying assumption of the  citation classifier is that any papers

280: which have been frequently cited by papers in a particular subject area

281: should be considered relevant to such subject area.

282:

283: Both classifiers have been trained on a set of 400 articles

284: published in the journal Nature during 1987.  In this sample, 39 records

285: were picked as being relevant to astronomy by a librarian.  The classifiers

286: were tested against the full set of articles published by Nature in 1997

287: (4033 records, 434 of which had citation data).  These records consist

288: of scientific research articles as well as short news, editorials,

289: book reviews, and obituaries.

290:

291:

292: \section{ The Text Classifier}

293:

294: The problem of text classification can be summarized as follows:

295: given a certain string of words from a document, which of a finite

296: set of categories can this document be best assigned to?

297: Following a probabilistic approach, we chose to implement a

298: Multinomial Naive Bayesian Classifier

299: which allows a straightforward

300: computation of the category with the maximum likelihood based on

301: the frequencies of the document's words within each category.

302: In our application, each category represents the set of documents

303: in a particular database.  Since the frequencies of the words in

304: each database are readily available from the database-specific indexes

305: that the ADS maintains, the computation of the probabilities

306: can be carried out in real time from the index data.

307:

308: The implementation of the classifier

309: showed that it performed well in classifying documents for which

310: at least 20 text words were available from either the title or the

311: abstract.

312: The challenge we were

313: presented was trying to classify records for which only a title was

314: available.  In order to improve the classifier, a number of pre- and

315: post-scoring steps were taken:

316:

317: \begin{itemize}

318: \item The input text was pre-processed using the standard parsing

319: rules used by the ADS search engine (Accomazzi et al 2000).

320:

321: \item All words consisting solely of digits were removed, as well

322: as title words and phrases which had no relevance for classification

323: (e.g. ``obituary'').

324:

325: \item The likelihood score generated by the Bayesian classifier was

326: normalized in order to limit the contribution of records consisting of

327: few words, for which the highest rate of misclassification was found.

328:

329: \item To compensate for the previous step, a set of

330: database-specific ``trigger'' keywords were defined which, when found,

331: boosted the classification score of the input text.

332: \end{itemize}

333:

334: The resulting classifier was implemented as a two-parameter function:

335: $N_t$, the minimum number of words required for

336: a document to be considered classifiable, and $S_t$, the minimum

337: classification score necessary for a document to be considered belonging

338: to a particular database.

339: The results of the classification are displayed in Figure~\ref{fig1}, where

340: the performance of the classifier can be judged by looking at

341: the Precision ($P$) and Recall ($R$) of the classification for each input set

342: of cutoff scores and minimum number of words.

343: As one can see from the plot, the classifier has been designed to yield a

344: high precision irregardless of the number of input words.  This is a crucial

345: issue for our application since we do not want to mistakenly assign

346: non-relevant records to any of the ADS databases.

347:

348:

349: \section{The Citation Classifier}

350:

351: The citation classifier was implemented to assign existing ADS

352: records to one or more databases based on how frequently they have

353: been cited by the records in those databases.

354: The underlying assumption of the citation classifier is that any papers

355: which have been frequently cited by papers in a particular subject area

356: should be considered relevant to such subject area (Kurtz et al 2002).

357: The scope and usefulness of this classifier is obviously limited by

358: the availability of citation data for the records being considered:

359: an article which has not been cited in any of the astronomy or

360: physics journals for which the ADS has reference data cannot be

361: categorized by the classifier.  However, since the coverage of

362: journal references from the core astronomy literature is

363: quite thorough in the ADS, we can expect that important research

364: articles will be cited with some frequency within astronomy.

365:

366: Based on this premise, we implemented a citation classifier

367: by considering, for each record for which citations are available,

368: the ratio between the number of

369: citations belonging to a particular database and the total number of citations.

370: If the ratio is high enough we can conclude that since a significant

371: portion of the papers citing the record in question come from a single

372: database, the paper in question is relevant to that database.

373: The citation classifier was implemented as a function taking as

374: input two parameters: $N_c$, the minimum number of citations required

375: for a record to be considered classifiable, and $R_c$, the ratio

376: between the number of citations in a particular database and the total

377: number of citations.

378:

379: The performance of the citation classifier was tested against

380: the set of 434 articles in the Nature sample which

381: have citation data available.  The results of the classifier are

382: summarized in Figure~\ref{fig1}.

383: Once again we notice little variation in the performance of the classifier

384: as a function of the total number of citations for a given paper, which

385: is a desirable feature.  On the other hand, given the limited number

386: of citations for some of the records available, the recall is

387: much lower than what was achieved with the text classifier.

388:

389:

390: \begin{figure}

391: \plotone{P.122_1.eps}

392: \caption{Results for the text and citation classifiers} \label{fig1}

393: \end{figure}

394:

395: \section{Discussion}

396:

397:

398: The text and citation classifiers described here have shown to

399: be a valuable tool in categorizing records from scientific journals

400: such as Nature and Science for the purpose of introducing them into

401: the Astronomy or Physics databases.

402: Further inspection of the results showed that

403: misclassified papers are often borderline cases involving subjects

404: such as Geophysics and Planetary Science which overlap the different

405: databases.  Additionally, a small number of records which were originally not

406: selected as belonging to Astronomy by the librarian were later

407: found to be relevant upon a subsequent review of the results by

408: the classifiers.

409:

410: Because the text and citation classifiers use

411: different data when assigning articles to a database, we find that

412: best overall results can be achieved by combining

413: the output from both classifiers.  By choosing

414: conservative settings for the parameters controlling the classifiers

415: ($S_t = 0.25, M_t = 5, R_c = 0.5, N_c = 4$).

416: we were able to achieve a precision of 0.94 with a

417: recall of 0.89 when classifying the sample against

418: the astronomy database.

419:

420: \acknowledgments

421: The ADS is funded by NASA Grant NCC5-189 and is available online at

422: \htmladdURL{http://ads.harvard.edu}

423:

424: %-----------------------------------------------------------------------

425: %			      References

426: %-----------------------------------------------------------------------

427: % List your references below within the reference environment

428: % (i.e. between the \begin{references} and \end{references} tags).

429: % Each new reference should begin with a \reference command which sets

430: % up the proper indentation.  Observe the following order when listing

431: % bibliographical information for each reference:  author name(s),

432: % publication year, journal name, volume, and page number for

433: % articles.  Note that many journal names are available as macros; see

434: % the User Guide listing "macro-ized" journals.

435: %

436: % EXAMPLE:  \reference Hagiwara, K., \& Zeppenfeld, D.\  1986,

437: %                Nucl.Phys., 274, 1

438: %           \reference H\'enon, M.\  1961, Ann.d'Ap., 24, 369

439: %           \reference King, I.\ R.\  1966, \aj, 71, 276

440: %           \reference King, I.\ R.\  1975, in Dynamics of Stellar

441: %                Systems, ed.\ A.\ Hayli (Dordrecht: Reidel), 99

442: %           \reference Tody, D.\  1998, \adassvii, 146

443: %           \reference Zacharias, N.\ \& Zacharias, M.\ 2003,

444: %                \adassxii, \paperref{P7.6}

445: %

446: % Note the following tricks used in the example above:

447: %

448: %   o  \& is used to format an ampersand symbol (&).

449: %   o  \'e puts an accent agu over the letter e.  See the User Guide

450: %      and the sample files for details on formatting special

451: %      characters.

452: %   o  "\ " after a period prevents LaTeX from interpreting the period

453: %      as an end of a sentence.

454: %   o  \aj is a macro that expands to "Astron. J."  See the User Guide

455: %      for a full list of journal macros

456: %   o  \adassvii is a macro that expands to the full title, editor,

457: %      and publishing information for the ADASS VII conference

458: %      proceedings.  Such macros are defined for ADASS conferences I

459: %      through XI.

460: %   o  When referencing a paper in the current volume, use the

461: %      \adassxii and \paperref macros.  The argument to \paperref is

462: %      the paper ID code for the paper you are referencing.  See the

463: %      note in the "Paper ID Code" section above for details on how to

464: %      determine the paper ID code for the paper you reference.

465: %

466: \begin{references}

467:

468: \reference Accomazzi, A., Eichhorn, G., Kurtz, M.~J.,

469: Grant, C.~S., \& Murray, S.~S.\ 2000, \aaps, 143, 85

470:

471: \reference Grant, C.~S., Accomazzi,

472: A., Eichhorn, G., Kurtz, M.~J., \& Murray, S.~S.\ 2000, \aaps, 143, 111

473:

474: \reference Kurtz, M.~J., Eichhorn,

475: G., Accomazzi, A., Grant, C.~S., Murray, S.~S., \& Watson, J.~M.\ 2000,

476: \aaps, 143, 41

477:

478: \reference Kurtz, M.~J., Eichhorn,

479: G., Accomazzi, A., Grant, C.~S., \& Murray, S.~S.\ 2002, Proc. SPIE, 4847,

480: 238

481:

482: \reference McCallum, A., Nigam, K.\ 1998,

483: AAAI-98 Workshop on Learning for Text Categorization,

484: \htmladdURL{http://www.cs.cmu.edu/\%7Ecmccallum}

485:

486: \end{references}

487:

488: % Do not place any material after the references section

489:

490: \end{document}  % Leave intact

491: