1: %APN3_PROCEEDINGS_FORM%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %
3: % TEMPLATE.TEX -- APN3 (2003) ASP Conference Proceedings template.
4: %
5: % Derived from ADASS VIII (98) ASP Conference Proceedings template
6: % Updated by N. Manset for ADASS IX (99), F. Primini for ADASS 2000,
7: % D.Bohlender for ADASS 2001, and H. Payne for ADASS XII and LaTeX2e.
8: %
9: % Use this template to create your proceedings paper in LaTeX format
10: % by following the instructions given below. Much of the input will
11: % be enclosed by braces (i.e., { }). The percent sign, "%", denotes
12: % the start of a comment; text after it will be ignored by LaTeX.
13: % You might also notice in some of the examples below the use of "\ "
14: % after a period; this prevents LaTeX from interpreting the period as
15: % the end of a sentence and putting extra space after it.
16: %
17: % You should check your paper by processing it with LaTeX. For
18: % details about how to run LaTeX as well as how to print out the User
19: % Guide, consult the README file. You should also consult the sample
20: % LaTeX papers, sample1.tex and sample2.tex, for examples of including
21: % figures, html links, special symbols, and other advanced features.
22: %
23: % If you do not have access to the LaTeX software or a laser printer
24: % at your site, you can still prepare your paper following the
25: % instructions in the User Guide. In such cases, the editors will
26: % process the file and make any necessary editorial adjustments.
27: %
28: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
29: %
30: \documentclass[11pt,twoside]{article} % Leave intact
31: \usepackage{adassconf}
32:
33: % If you have the old LaTeX 2.09, and not the current LaTeX2e, comment
34: % out the \documentclass and \usepackage lines above and uncomment
35: % the following:
36:
37: %\documentstyle[11pt,twoside,adassconf]{article}
38:
39: \begin{document} % Leave intact
40:
41: %-----------------------------------------------------------------------
42: % Paper ID Code
43: %-----------------------------------------------------------------------
44: % Enter the proper paper identification code. The ID code for your
45: % paper is the session number associated with your presentation as
46: % published in the official conference proceedings. You can
47: % find this number locating your abstract in the printed proceedings
48: % that you received at the meeting or on-line at the conference web
49: % site; the ID code is the letter/number sequence proceeding the title
50: % of your presentation.
51: %
52: % This will not appear in your paper; however, it allows different
53: % papers in the proceedings to cross-reference each other. Note that
54: % you should only have one \paperID, and it should not include a
55: % trailing period.
56: %
57: % EXAMPLE: \paperID{O4-1}
58: % EXAMPLE: \paperID{P7-7}
59: %
60:
61: \paperID{P.122}
62:
63: %-----------------------------------------------------------------------
64: % Paper Title
65: %-----------------------------------------------------------------------
66: % Enter the title of the paper.
67: %
68: % EXAMPLE: \title{A Breakthrough in Astronomical Software Development}
69: %
70: % If your title is so long as to fill the page header when you print it,
71: % then please supply a short form as a \titlemark.
72: %
73: % EXAMPLE:
74: % \title{Rapid Development for Distributed Computing, with Implications
75: % for the Virtual Observatory}
76: % \titlemark{Rapid Development for Distributed Computing}
77: %
78:
79: \title{Bibliographic Classification using the ADS Databases}
80: %\titlemark{ }
81:
82: %-----------------------------------------------------------------------
83: % Authors of Paper
84: %-----------------------------------------------------------------------
85: % Enter the authors followed by their affiliations. The \author and
86: % \affil commands may appear multiple times as necessary (see example
87: % below). List each author by giving the first name or initials first
88: % followed by the last name. Authors with the same affiliations
89: % should grouped together.
90: %
91: % EXAMPLE: \author{Raymond Plante, Doug Roberts,
92: % R.\ M.\ Crutcher\altaffilmark{1}}
93: % \affil{National Center for Supercomputing Applications,
94: % University of Illinois Urbana-Champaign, Urbana, IL
95: % 61801}
96: % \author{Tom Troland}
97: % \affil{University of Kentucky}
98: %
99: % \altaffiltext{1}{Astronomy Department, UIUC}
100: %
101: % In this example, the first three authors, "Plante", "Roberts", and
102: % "Crutcher" are affiliated with "NCSA". "Crutcher" has an alternate
103: % affiliation with the "Astronomy Department". The fourth author,
104: % "Troland", is affiliated with "University of Kentucky"
105:
106: \author{Alberto Accomazzi,
107: Michael J. Kurtz,
108: G\"unther Eichhorn,
109: Edwin Henneken,
110: Carolyn S. Grant,
111: Markus Demleitner,
112: Stephen S. Murray}
113: \affil{Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138}
114:
115: %-----------------------------------------------------------------------
116: % Contact Information
117: %-----------------------------------------------------------------------
118: % This information will not appear in the paper but will be used by
119: % the editors in case you need to be contacted concerning your
120: % submission. Enter your name as the contact along with your email
121: % address.
122: %
123: % EXAMPLE: \contact{Dennis Crabtree}
124: % \email{crabtree@cfht.hawaii.edu}
125: %
126:
127: \contact{Alberto Accomazzi }
128: \email{aaccomazzi@cfa.harvard.edu }
129:
130: %-----------------------------------------------------------------------
131: % Author Index Specification
132: %-----------------------------------------------------------------------
133: % Specify how each author name should appear in the author index. The
134: % \paindex{ } should be used to indicate the primary author, and the
135: % \aindex for all other co-authors. You MUST use the following
136: % syntax:
137: %
138: % SYNTAX: \aindex{Lastname, F. M.}
139: %
140: % where F is the first initial and M is the second initial (if
141: % used). This guarantees that authors that appear in multiple papers
142: % will appear only once in the author index.
143: %
144: % EXAMPLE: \paindex{Crabtree, D.}
145: % \aindex{Manset, N.}
146: % \aindex{Veillet, C.}
147: %
148: % NOTE: this information is also used to build the author list that
149: % appears in the table of contents. Authors will be listed in the order
150: % of the \paindex and \aindex commmands.
151: %
152:
153: \paindex{Accomazzi, A. }
154: \aindex{Kurtz, M. J.}
155: \aindex{Eichhorn, G.}
156: \aindex{Henneken, E.}
157: \aindex{Grant, C. S.}
158: \aindex{Demleitner, M.}
159: \aindex{Murray, S. S.}
160:
161: %-----------------------------------------------------------------------
162: % Author list for page header
163: %-----------------------------------------------------------------------
164: % Please supply a list of author last names for the page header. in
165: % one of these formats:
166: %
167: % EXAMPLES:
168: % \authormark{Lastname}
169: % \authormark{Lastname1 \& Lastname2}
170: % \authormark{Lastname1, Lastname2, ... \& LastnameN}
171: % \authormark{Lastname et al.}
172: %
173: % Use the "et al." form in the case of seven or more authors, or if
174: % the preferred form is too long to fit in the header.
175:
176: \authormark{Accomazzi et al.}
177:
178: %-----------------------------------------------------------------------
179: % Subject Index keywords
180: %-----------------------------------------------------------------------
181: % Enter a comma separated list of up to 6 keywords describing your
182: % paper. These will NOT be printed as part of your paper; however,
183: % they will be used to generate the subject index for the proceedings.
184: % There is no standard list; however, you can consult the indices
185: % for past proceedings (http://adass.org/adass/proceedings/).
186: %
187: % EXAMPLE: \keywords{visualization, astronomy: radio, parallel
188: % computing, AIPS++, Galactic Center}
189: %
190: % In this example, the author noticed that "radio astronomy" appeared
191: % in the ADASS VII Index as "astronomy" being the major keyword and
192: % "radio" as the minor keyword. The colon is used to introduce another
193: % level into the index.
194:
195: \keywords{Classification, Bibliographies}
196:
197: %-----------------------------------------------------------------------
198: % Abstract
199: %-----------------------------------------------------------------------
200: % Type abstract in the space below. Consult the User Guide and Latex
201: % Information file for a list of supported macros (e.g. for typesetting
202: % special symbols). Do not leave a blank line between \begin{abstract}
203: % and the start of your text.
204:
205: \begin{abstract} % Leave intact
206:
207: We discuss two techniques used to characterize bibliographic records based
208: on their similarity to and relationship with the contents of the
209: NASA Astrophysics Data System (ADS) databases.
210: The first method has been used to classify input text as
211: being relevant to one or more subject areas based on an analysis of
212: the frequency distribution of its individual words.
213: The second method has been used to classify existing records as
214: being relevant to one or more databases based on the distribution
215: of the papers citing them. Both techniques have proven to be valuable
216: tools in assigning new and existing bibliographic records to different
217: disciplines within the ADS databases.
218:
219: \end{abstract}
220:
221: %-----------------------------------------------------------------------
222: % Main Body
223: %-----------------------------------------------------------------------
224: % Place the text for the main body of the paper here. You should use
225: % the \section command to label the various sections; use of
226: % \subsection is optional. Significant words in section titles should
227: % be capitalized. Sections and subsections will be numbered
228: % automatically.
229: %
230: % EXAMPLE: \section{Introduction}
231: % ...
232: % \subsection{Our View of the World}
233: % ...
234: % \section{A New Approach}
235: %
236: % It is recommended that you look at the sample papers, sample1.tex
237: % and sample2.tex, for examples for formatting references, footnotes,
238: % figures, equations, html links, lists, and other special features.
239:
240: \section{Overview}
241:
242:
243: The NASA Astrophysics Data System (ADS; Kurtz et al 2000)
244: maintains three main databases
245: of scientific bibliographies: Astronomy, Physics,
246: and the ArXiv e-prints.
247: Over the past few years the ADS has created and maintained a separate
248: ``general'' database containing records which do not readily fit in
249: the three main databases. The use for the general database is
250: twofold: it servers as a staging area for bibliographic records which may
251: be later incorporated into one of the other databases and it provides
252: a placeholder for those records which, while not being directly related
253: to physics or astronomy, may be cited by or citing them. For instance,
254: it is not unusual for physics papers to cite articles in chemistry
255: or computer science and vice versa. The typical
256: use of such a database is to store all records from inter-disciplinary
257: journals such as {\it Science} and {\it Nature}. While some of the
258: articles published in these journals will be entered in the Astronomy
259: and Physics databases, their full table of contents will always be
260: available in the general database.
261:
262: When new records are provided to the ADS without any meta data enabling
263: them to be reliably labeled as belonging to either physics or astronomy
264: or physics, a decision has to be made in terms of which
265: database they should be assigned to. Given the sheer amount of
266: bibliographic data being handled by the ADS project
267: (Grant et al 2000), this decision
268: has to be made automatically most of the time. This paper describes how
269: we have made use of two different tools to help us with the automatic
270: classification of bibliographic records. The first tool is a
271: text classifier which performs an analysis of textual data based
272: on a well-known Bayesian probabilistic model (McCollum and Nigam 1998).
273: Classification of a document is performed by estimating
274: the likelihood of its membership in a certain database based on the
275: relative frequency of the words from the text in that database.
276: The second tool is a citation classifier which assigns existing ADS
277: records to one or more databases based on how frequently they have
278: been cited by the records in those databases.
279: The underlying assumption of the citation classifier is that any papers
280: which have been frequently cited by papers in a particular subject area
281: should be considered relevant to such subject area.
282:
283: Both classifiers have been trained on a set of 400 articles
284: published in the journal Nature during 1987. In this sample, 39 records
285: were picked as being relevant to astronomy by a librarian. The classifiers
286: were tested against the full set of articles published by Nature in 1997
287: (4033 records, 434 of which had citation data). These records consist
288: of scientific research articles as well as short news, editorials,
289: book reviews, and obituaries.
290:
291:
292: \section{ The Text Classifier}
293:
294: The problem of text classification can be summarized as follows:
295: given a certain string of words from a document, which of a finite
296: set of categories can this document be best assigned to?
297: Following a probabilistic approach, we chose to implement a
298: Multinomial Naive Bayesian Classifier
299: which allows a straightforward
300: computation of the category with the maximum likelihood based on
301: the frequencies of the document's words within each category.
302: In our application, each category represents the set of documents
303: in a particular database. Since the frequencies of the words in
304: each database are readily available from the database-specific indexes
305: that the ADS maintains, the computation of the probabilities
306: can be carried out in real time from the index data.
307:
308: The implementation of the classifier
309: showed that it performed well in classifying documents for which
310: at least 20 text words were available from either the title or the
311: abstract.
312: The challenge we were
313: presented was trying to classify records for which only a title was
314: available. In order to improve the classifier, a number of pre- and
315: post-scoring steps were taken:
316:
317: \begin{itemize}
318: \item The input text was pre-processed using the standard parsing
319: rules used by the ADS search engine (Accomazzi et al 2000).
320:
321: \item All words consisting solely of digits were removed, as well
322: as title words and phrases which had no relevance for classification
323: (e.g. ``obituary'').
324:
325: \item The likelihood score generated by the Bayesian classifier was
326: normalized in order to limit the contribution of records consisting of
327: few words, for which the highest rate of misclassification was found.
328:
329: \item To compensate for the previous step, a set of
330: database-specific ``trigger'' keywords were defined which, when found,
331: boosted the classification score of the input text.
332: \end{itemize}
333:
334: The resulting classifier was implemented as a two-parameter function:
335: $N_t$, the minimum number of words required for
336: a document to be considered classifiable, and $S_t$, the minimum
337: classification score necessary for a document to be considered belonging
338: to a particular database.
339: The results of the classification are displayed in Figure~\ref{fig1}, where
340: the performance of the classifier can be judged by looking at
341: the Precision ($P$) and Recall ($R$) of the classification for each input set
342: of cutoff scores and minimum number of words.
343: As one can see from the plot, the classifier has been designed to yield a
344: high precision irregardless of the number of input words. This is a crucial
345: issue for our application since we do not want to mistakenly assign
346: non-relevant records to any of the ADS databases.
347:
348:
349: \section{The Citation Classifier}
350:
351: The citation classifier was implemented to assign existing ADS
352: records to one or more databases based on how frequently they have
353: been cited by the records in those databases.
354: The underlying assumption of the citation classifier is that any papers
355: which have been frequently cited by papers in a particular subject area
356: should be considered relevant to such subject area (Kurtz et al 2002).
357: The scope and usefulness of this classifier is obviously limited by
358: the availability of citation data for the records being considered:
359: an article which has not been cited in any of the astronomy or
360: physics journals for which the ADS has reference data cannot be
361: categorized by the classifier. However, since the coverage of
362: journal references from the core astronomy literature is
363: quite thorough in the ADS, we can expect that important research
364: articles will be cited with some frequency within astronomy.
365:
366: Based on this premise, we implemented a citation classifier
367: by considering, for each record for which citations are available,
368: the ratio between the number of
369: citations belonging to a particular database and the total number of citations.
370: If the ratio is high enough we can conclude that since a significant
371: portion of the papers citing the record in question come from a single
372: database, the paper in question is relevant to that database.
373: The citation classifier was implemented as a function taking as
374: input two parameters: $N_c$, the minimum number of citations required
375: for a record to be considered classifiable, and $R_c$, the ratio
376: between the number of citations in a particular database and the total
377: number of citations.
378:
379: The performance of the citation classifier was tested against
380: the set of 434 articles in the Nature sample which
381: have citation data available. The results of the classifier are
382: summarized in Figure~\ref{fig1}.
383: Once again we notice little variation in the performance of the classifier
384: as a function of the total number of citations for a given paper, which
385: is a desirable feature. On the other hand, given the limited number
386: of citations for some of the records available, the recall is
387: much lower than what was achieved with the text classifier.
388:
389:
390: \begin{figure}
391: \plotone{P.122_1.eps}
392: \caption{Results for the text and citation classifiers} \label{fig1}
393: \end{figure}
394:
395: \section{Discussion}
396:
397:
398: The text and citation classifiers described here have shown to
399: be a valuable tool in categorizing records from scientific journals
400: such as Nature and Science for the purpose of introducing them into
401: the Astronomy or Physics databases.
402: Further inspection of the results showed that
403: misclassified papers are often borderline cases involving subjects
404: such as Geophysics and Planetary Science which overlap the different
405: databases. Additionally, a small number of records which were originally not
406: selected as belonging to Astronomy by the librarian were later
407: found to be relevant upon a subsequent review of the results by
408: the classifiers.
409:
410: Because the text and citation classifiers use
411: different data when assigning articles to a database, we find that
412: best overall results can be achieved by combining
413: the output from both classifiers. By choosing
414: conservative settings for the parameters controlling the classifiers
415: ($S_t = 0.25, M_t = 5, R_c = 0.5, N_c = 4$).
416: we were able to achieve a precision of 0.94 with a
417: recall of 0.89 when classifying the sample against
418: the astronomy database.
419:
420: \acknowledgments
421: The ADS is funded by NASA Grant NCC5-189 and is available online at
422: \htmladdURL{http://ads.harvard.edu}
423:
424: %-----------------------------------------------------------------------
425: % References
426: %-----------------------------------------------------------------------
427: % List your references below within the reference environment
428: % (i.e. between the \begin{references} and \end{references} tags).
429: % Each new reference should begin with a \reference command which sets
430: % up the proper indentation. Observe the following order when listing
431: % bibliographical information for each reference: author name(s),
432: % publication year, journal name, volume, and page number for
433: % articles. Note that many journal names are available as macros; see
434: % the User Guide listing "macro-ized" journals.
435: %
436: % EXAMPLE: \reference Hagiwara, K., \& Zeppenfeld, D.\ 1986,
437: % Nucl.Phys., 274, 1
438: % \reference H\'enon, M.\ 1961, Ann.d'Ap., 24, 369
439: % \reference King, I.\ R.\ 1966, \aj, 71, 276
440: % \reference King, I.\ R.\ 1975, in Dynamics of Stellar
441: % Systems, ed.\ A.\ Hayli (Dordrecht: Reidel), 99
442: % \reference Tody, D.\ 1998, \adassvii, 146
443: % \reference Zacharias, N.\ \& Zacharias, M.\ 2003,
444: % \adassxii, \paperref{P7.6}
445: %
446: % Note the following tricks used in the example above:
447: %
448: % o \& is used to format an ampersand symbol (&).
449: % o \'e puts an accent agu over the letter e. See the User Guide
450: % and the sample files for details on formatting special
451: % characters.
452: % o "\ " after a period prevents LaTeX from interpreting the period
453: % as an end of a sentence.
454: % o \aj is a macro that expands to "Astron. J." See the User Guide
455: % for a full list of journal macros
456: % o \adassvii is a macro that expands to the full title, editor,
457: % and publishing information for the ADASS VII conference
458: % proceedings. Such macros are defined for ADASS conferences I
459: % through XI.
460: % o When referencing a paper in the current volume, use the
461: % \adassxii and \paperref macros. The argument to \paperref is
462: % the paper ID code for the paper you are referencing. See the
463: % note in the "Paper ID Code" section above for details on how to
464: % determine the paper ID code for the paper you reference.
465: %
466: \begin{references}
467:
468: \reference Accomazzi, A., Eichhorn, G., Kurtz, M.~J.,
469: Grant, C.~S., \& Murray, S.~S.\ 2000, \aaps, 143, 85
470:
471: \reference Grant, C.~S., Accomazzi,
472: A., Eichhorn, G., Kurtz, M.~J., \& Murray, S.~S.\ 2000, \aaps, 143, 111
473:
474: \reference Kurtz, M.~J., Eichhorn,
475: G., Accomazzi, A., Grant, C.~S., Murray, S.~S., \& Watson, J.~M.\ 2000,
476: \aaps, 143, 41
477:
478: \reference Kurtz, M.~J., Eichhorn,
479: G., Accomazzi, A., Grant, C.~S., \& Murray, S.~S.\ 2002, Proc. SPIE, 4847,
480: 238
481:
482: \reference McCallum, A., Nigam, K.\ 1998,
483: AAAI-98 Workshop on Learning for Text Categorization,
484: \htmladdURL{http://www.cs.cmu.edu/\%7Ecmccallum}
485:
486: \end{references}
487:
488: % Do not place any material after the references section
489:
490: \end{document} % Leave intact
491: