cs0511002/P.122.tex
1: %APN3_PROCEEDINGS_FORM%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: %
3: % TEMPLATE.TEX -- APN3 (2003) ASP Conference Proceedings template.
4: %
5: % Derived from ADASS VIII (98) ASP Conference Proceedings template
6: % Updated by N. Manset for ADASS IX (99), F. Primini for ADASS 2000,
7: % D.Bohlender for ADASS 2001, and H. Payne for ADASS XII and LaTeX2e.
8: %
9: % Use this template to create your proceedings paper in LaTeX format
10: % by following the instructions given below.  Much of the input will
11: % be enclosed by braces (i.e., { }).  The percent sign, "%", denotes
12: % the start of a comment; text after it will be ignored by LaTeX.  
13: % You might also notice in some of the examples below the use of "\ "
14: % after a period; this prevents LaTeX from interpreting the period as
15: % the end of a sentence and putting extra space after it.  
16: % 
17: % You should check your paper by processing it with LaTeX.  For
18: % details about how to run LaTeX as well as how to print out the User
19: % Guide, consult the README file.  You should also consult the sample
20: % LaTeX papers, sample1.tex and sample2.tex, for examples of including
21: % figures, html links, special symbols, and other advanced features.
22: %
23: % If you do not have access to the LaTeX software or a laser printer
24: % at your site, you can still prepare your paper following the
25: % instructions in the User Guide.  In such cases, the editors will
26: % process the file and make any necessary editorial adjustments.
27: % 
28: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
29: % 
30: \documentclass[11pt,twoside]{article}  % Leave intact
31: \usepackage{adassconf}
32: 
33: % If you have the old LaTeX 2.09, and not the current LaTeX2e, comment
34: % out the \documentclass and \usepackage lines above and uncomment
35: % the following:
36: 
37: %\documentstyle[11pt,twoside,adassconf]{article}
38: 
39: \begin{document}   % Leave intact
40: 
41: %-----------------------------------------------------------------------
42: %			    Paper ID Code
43: %-----------------------------------------------------------------------
44: % Enter the proper paper identification code.  The ID code for your
45: % paper is the session number associated with your presentation as
46: % published in the official conference proceedings.  You can           
47: % find this number locating your abstract in the printed proceedings
48: % that you received at the meeting or on-line at the conference web
49: % site; the ID code is the letter/number sequence proceeding the title 
50: % of your presentation. 
51: %
52: % This will not appear in your paper; however, it allows different
53: % papers in the proceedings to cross-reference each other.  Note that
54: % you should only have one \paperID, and it should not include a
55: % trailing period.
56: %
57: % EXAMPLE: \paperID{O4-1}
58: % EXAMPLE: \paperID{P7-7}
59: %
60: 
61: \paperID{P.122}
62: 
63: %-----------------------------------------------------------------------
64: %		            Paper Title 
65: %-----------------------------------------------------------------------
66: % Enter the title of the paper.
67: %
68: % EXAMPLE: \title{A Breakthrough in Astronomical Software Development}
69: % 
70: % If your title is so long as to fill the page header when you print it,
71: % then please supply a short form as a \titlemark.
72: %
73: % EXAMPLE: 
74: %  \title{Rapid Development for Distributed Computing, with Implications
75: %         for the Virtual Observatory}
76: %  \titlemark{Rapid Development for Distributed Computing}
77: %
78: 
79: \title{Bibliographic Classification using the ADS Databases}
80: %\titlemark{ }
81: 
82: %-----------------------------------------------------------------------
83: %		          Authors of Paper
84: %-----------------------------------------------------------------------
85: % Enter the authors followed by their affiliations.  The \author and
86: % \affil commands may appear multiple times as necessary (see example
87: % below).  List each author by giving the first name or initials first
88: % followed by the last name.  Authors with the same affiliations
89: % should grouped together. 
90: %
91: % EXAMPLE: \author{Raymond Plante, Doug Roberts, 
92: %                  R.\ M.\ Crutcher\altaffilmark{1}}
93: %          \affil{National Center for Supercomputing Applications, 
94: %                 University of Illinois Urbana-Champaign, Urbana, IL
95: %                 61801}
96: %          \author{Tom Troland}
97: %          \affil{University of Kentucky}
98: %
99: %          \altaffiltext{1}{Astronomy Department, UIUC}
100: %
101: % In this example, the first three authors, "Plante", "Roberts", and
102: % "Crutcher" are affiliated with "NCSA".  "Crutcher" has an alternate 
103: % affiliation with the "Astronomy Department".  The fourth author,
104: % "Troland", is affiliated with "University of Kentucky"
105: 
106: \author{Alberto Accomazzi,
107: Michael J. Kurtz,
108: G\"unther Eichhorn,
109: Edwin Henneken,
110: Carolyn S. Grant,
111: Markus Demleitner,
112: Stephen S. Murray}
113: \affil{Harvard-Smithsonian Center for Astrophysics, 60 Garden Street, Cambridge, MA 02138}
114: 
115: %-----------------------------------------------------------------------
116: %			 Contact Information
117: %-----------------------------------------------------------------------
118: % This information will not appear in the paper but will be used by
119: % the editors in case you need to be contacted concerning your
120: % submission.  Enter your name as the contact along with your email
121: % address.
122: % 
123: % EXAMPLE:  \contact{Dennis Crabtree}
124: %           \email{crabtree@cfht.hawaii.edu}
125: %
126: 
127: \contact{Alberto Accomazzi }
128: \email{aaccomazzi@cfa.harvard.edu }
129: 
130: %-----------------------------------------------------------------------
131: %		      Author Index Specification
132: %-----------------------------------------------------------------------
133: % Specify how each author name should appear in the author index.  The 
134: % \paindex{ } should be used to indicate the primary author, and the
135: % \aindex for all other co-authors.  You MUST use the following
136: % syntax: 
137: %
138: % SYNTAX:  \aindex{Lastname, F. M.}
139: % 
140: % where F is the first initial and M is the second initial (if
141: % used).  This guarantees that authors that appear in multiple papers
142: % will appear only once in the author index.  
143: %
144: % EXAMPLE: \paindex{Crabtree, D.}
145: %          \aindex{Manset, N.}        
146: %          \aindex{Veillet, C.}        
147: %
148: % NOTE: this information is also used to build the author list that
149: % appears in the table of contents.  Authors will be listed in the order
150: % of the \paindex and \aindex commmands.
151: %
152: 
153: \paindex{Accomazzi, A. }
154: \aindex{Kurtz, M. J.}
155: \aindex{Eichhorn, G.}
156: \aindex{Henneken, E.}
157: \aindex{Grant, C. S.}
158: \aindex{Demleitner, M.}
159: \aindex{Murray, S. S.}
160: 
161: %-----------------------------------------------------------------------
162: %		      Author list for page header	
163: %-----------------------------------------------------------------------
164: % Please supply a list of author last names for the page header. in
165: % one of these formats:
166: %
167: % EXAMPLES:
168: % \authormark{Lastname}
169: % \authormark{Lastname1 \& Lastname2}
170: % \authormark{Lastname1, Lastname2, ... \& LastnameN}
171: % \authormark{Lastname et al.}
172: %
173: % Use the "et al." form in the case of seven or more authors, or if
174: % the preferred form is too long to fit in the header.
175: 
176: \authormark{Accomazzi et al.}
177: 
178: %-----------------------------------------------------------------------
179: %			Subject Index keywords
180: %-----------------------------------------------------------------------
181: % Enter a comma separated list of up to 6 keywords describing your
182: % paper.  These will NOT be printed as part of your paper; however,
183: % they will be used to generate the subject index for the proceedings.
184: % There is no standard list; however, you can consult the indices
185: % for past proceedings (http://adass.org/adass/proceedings/).
186: %
187: % EXAMPLE:  \keywords{visualization, astronomy: radio, parallel
188: %                     computing, AIPS++, Galactic Center}
189: %
190: % In this example, the author noticed that "radio astronomy" appeared
191: % in the ADASS VII Index as "astronomy" being the major keyword and
192: % "radio" as the minor keyword.  The colon is used to introduce another
193: % level into the index.
194: 
195: \keywords{Classification, Bibliographies}
196: 
197: %-----------------------------------------------------------------------
198: %			       Abstract
199: %-----------------------------------------------------------------------
200: % Type abstract in the space below.  Consult the User Guide and Latex
201: % Information file for a list of supported macros (e.g. for typesetting 
202: % special symbols). Do not leave a blank line between \begin{abstract} 
203: % and the start of your text.
204: 
205: \begin{abstract}          % Leave intact
206: 
207: We discuss two techniques used to characterize bibliographic records based
208: on their similarity to and relationship with the contents of the
209: NASA Astrophysics Data System (ADS) databases.
210: The first method has been used to classify input text as
211: being relevant to one or more subject areas based on an analysis of
212: the frequency distribution of its individual words.
213: The second method has been used to classify existing records as
214: being relevant to one or more databases based on the distribution
215: of the papers citing them. Both techniques have proven to be valuable
216: tools in assigning new and existing bibliographic records to different
217: disciplines within the ADS databases.
218: 
219: \end{abstract}
220: 
221: %-----------------------------------------------------------------------
222: %			      Main Body
223: %-----------------------------------------------------------------------
224: % Place the text for the main body of the paper here.  You should use
225: % the \section command to label the various sections; use of
226: % \subsection is optional.  Significant words in section titles should
227: % be capitalized.  Sections and subsections will be numbered
228: % automatically. 
229: %
230: % EXAMPLE:  \section{Introduction}
231: %           ...
232: %           \subsection{Our View of the World}
233: %           ...
234: %           \section{A New Approach}
235: %
236: % It is recommended that you look at the sample papers, sample1.tex
237: % and sample2.tex, for examples for formatting references, footnotes,
238: % figures, equations, html links, lists, and other special features.  
239: 
240: \section{Overview}
241: 
242: 
243: The NASA Astrophysics Data System (ADS; Kurtz et al 2000) 
244: maintains three main databases
245: of scientific bibliographies: Astronomy, Physics,
246: and the ArXiv e-prints.
247: Over the past few years the ADS has created and maintained a separate
248: ``general'' database containing records which do not readily fit in
249: the three main databases.  The use for the general database is
250: twofold: it servers as a staging area for bibliographic records which may
251: be later incorporated into one of the other databases and it provides
252: a placeholder for those records which, while not being directly related
253: to physics or astronomy, may be cited by or citing them.  For instance,
254: it is not unusual for physics papers to cite articles in chemistry
255: or computer science and vice versa.  The typical
256: use of such a database is to store all records from inter-disciplinary
257: journals such as {\it Science} and {\it Nature}.  While some of the
258: articles published in these journals will be entered in the Astronomy
259: and Physics databases, their full table of contents will always be
260: available in the general database.
261: 
262: When new records are provided to the ADS without any meta data enabling
263: them to be reliably labeled as belonging to either physics or astronomy
264: or physics, a decision has to be made in terms of which
265: database they should be assigned to.  Given the sheer amount of
266: bibliographic data being handled by the ADS project
267: (Grant et al 2000), this decision
268: has to be made automatically most of the time.  This paper describes how
269: we have made use of two different tools to help us with the automatic
270: classification of bibliographic records.  The first tool is a
271: text classifier which performs an analysis of textual data based
272: on a well-known Bayesian probabilistic model (McCollum and Nigam 1998).
273: Classification of a document is performed by estimating
274: the likelihood of its membership in a certain database based on the
275: relative frequency of the words from the text in that database.
276: The second tool is a citation classifier which assigns existing ADS
277: records to one or more databases based on how frequently they have
278: been cited by the records in those databases.
279: The underlying assumption of the  citation classifier is that any papers
280: which have been frequently cited by papers in a particular subject area
281: should be considered relevant to such subject area.
282: 
283: Both classifiers have been trained on a set of 400 articles
284: published in the journal Nature during 1987.  In this sample, 39 records
285: were picked as being relevant to astronomy by a librarian.  The classifiers
286: were tested against the full set of articles published by Nature in 1997
287: (4033 records, 434 of which had citation data).  These records consist
288: of scientific research articles as well as short news, editorials,
289: book reviews, and obituaries.
290: 
291: 
292: \section{ The Text Classifier}
293: 
294: The problem of text classification can be summarized as follows:
295: given a certain string of words from a document, which of a finite
296: set of categories can this document be best assigned to?
297: Following a probabilistic approach, we chose to implement a
298: Multinomial Naive Bayesian Classifier
299: which allows a straightforward
300: computation of the category with the maximum likelihood based on 
301: the frequencies of the document's words within each category.
302: In our application, each category represents the set of documents
303: in a particular database.  Since the frequencies of the words in
304: each database are readily available from the database-specific indexes
305: that the ADS maintains, the computation of the probabilities
306: can be carried out in real time from the index data.
307: 
308: The implementation of the classifier
309: showed that it performed well in classifying documents for which
310: at least 20 text words were available from either the title or the
311: abstract.
312: The challenge we were
313: presented was trying to classify records for which only a title was
314: available.  In order to improve the classifier, a number of pre- and
315: post-scoring steps were taken:
316: 
317: \begin{itemize}
318: \item The input text was pre-processed using the standard parsing
319: rules used by the ADS search engine (Accomazzi et al 2000).
320: 
321: \item All words consisting solely of digits were removed, as well
322: as title words and phrases which had no relevance for classification
323: (e.g. ``obituary'').
324: 
325: \item The likelihood score generated by the Bayesian classifier was
326: normalized in order to limit the contribution of records consisting of
327: few words, for which the highest rate of misclassification was found.
328: 
329: \item To compensate for the previous step, a set of
330: database-specific ``trigger'' keywords were defined which, when found,
331: boosted the classification score of the input text.
332: \end{itemize}
333: 
334: The resulting classifier was implemented as a two-parameter function:
335: $N_t$, the minimum number of words required for
336: a document to be considered classifiable, and $S_t$, the minimum
337: classification score necessary for a document to be considered belonging
338: to a particular database.
339: The results of the classification are displayed in Figure~\ref{fig1}, where
340: the performance of the classifier can be judged by looking at
341: the Precision ($P$) and Recall ($R$) of the classification for each input set
342: of cutoff scores and minimum number of words.  
343: As one can see from the plot, the classifier has been designed to yield a
344: high precision irregardless of the number of input words.  This is a crucial
345: issue for our application since we do not want to mistakenly assign
346: non-relevant records to any of the ADS databases. 
347: 
348: 
349: \section{The Citation Classifier}
350: 
351: The citation classifier was implemented to assign existing ADS
352: records to one or more databases based on how frequently they have
353: been cited by the records in those databases.
354: The underlying assumption of the citation classifier is that any papers
355: which have been frequently cited by papers in a particular subject area
356: should be considered relevant to such subject area (Kurtz et al 2002).
357: The scope and usefulness of this classifier is obviously limited by
358: the availability of citation data for the records being considered:
359: an article which has not been cited in any of the astronomy or
360: physics journals for which the ADS has reference data cannot be
361: categorized by the classifier.  However, since the coverage of
362: journal references from the core astronomy literature is
363: quite thorough in the ADS, we can expect that important research
364: articles will be cited with some frequency within astronomy.
365: 
366: Based on this premise, we implemented a citation classifier
367: by considering, for each record for which citations are available,
368: the ratio between the number of
369: citations belonging to a particular database and the total number of citations.
370: If the ratio is high enough we can conclude that since a significant
371: portion of the papers citing the record in question come from a single
372: database, the paper in question is relevant to that database.
373: The citation classifier was implemented as a function taking as
374: input two parameters: $N_c$, the minimum number of citations required
375: for a record to be considered classifiable, and $R_c$, the ratio
376: between the number of citations in a particular database and the total
377: number of citations.
378: 
379: The performance of the citation classifier was tested against
380: the set of 434 articles in the Nature sample which
381: have citation data available.  The results of the classifier are
382: summarized in Figure~\ref{fig1}.
383: Once again we notice little variation in the performance of the classifier
384: as a function of the total number of citations for a given paper, which
385: is a desirable feature.  On the other hand, given the limited number
386: of citations for some of the records available, the recall is
387: much lower than what was achieved with the text classifier.
388: 
389: 
390: \begin{figure}
391: \plotone{P.122_1.eps}
392: \caption{Results for the text and citation classifiers} \label{fig1}
393: \end{figure}
394: 
395: \section{Discussion}
396: 
397: 
398: The text and citation classifiers described here have shown to
399: be a valuable tool in categorizing records from scientific journals
400: such as Nature and Science for the purpose of introducing them into
401: the Astronomy or Physics databases.
402: Further inspection of the results showed that 
403: misclassified papers are often borderline cases involving subjects
404: such as Geophysics and Planetary Science which overlap the different
405: databases.  Additionally, a small number of records which were originally not
406: selected as belonging to Astronomy by the librarian were later
407: found to be relevant upon a subsequent review of the results by
408: the classifiers.
409: 
410: Because the text and citation classifiers use
411: different data when assigning articles to a database, we find that
412: best overall results can be achieved by combining
413: the output from both classifiers.  By choosing
414: conservative settings for the parameters controlling the classifiers
415: ($S_t = 0.25, M_t = 5, R_c = 0.5, N_c = 4$).
416: we were able to achieve a precision of 0.94 with a
417: recall of 0.89 when classifying the sample against
418: the astronomy database.
419: 
420: \acknowledgments
421: The ADS is funded by NASA Grant NCC5-189 and is available online at
422: \htmladdURL{http://ads.harvard.edu}
423: 
424: %-----------------------------------------------------------------------
425: %			      References
426: %-----------------------------------------------------------------------
427: % List your references below within the reference environment
428: % (i.e. between the \begin{references} and \end{references} tags).
429: % Each new reference should begin with a \reference command which sets
430: % up the proper indentation.  Observe the following order when listing
431: % bibliographical information for each reference:  author name(s),
432: % publication year, journal name, volume, and page number for
433: % articles.  Note that many journal names are available as macros; see
434: % the User Guide listing "macro-ized" journals.   
435: %
436: % EXAMPLE:  \reference Hagiwara, K., \& Zeppenfeld, D.\  1986, 
437: %                Nucl.Phys., 274, 1
438: %           \reference H\'enon, M.\  1961, Ann.d'Ap., 24, 369
439: %           \reference King, I.\ R.\  1966, \aj, 71, 276
440: %           \reference King, I.\ R.\  1975, in Dynamics of Stellar 
441: %                Systems, ed.\ A.\ Hayli (Dordrecht: Reidel), 99
442: %           \reference Tody, D.\  1998, \adassvii, 146
443: %           \reference Zacharias, N.\ \& Zacharias, M.\ 2003,
444: %                \adassxii, \paperref{P7.6}
445: % 
446: % Note the following tricks used in the example above:
447: %
448: %   o  \& is used to format an ampersand symbol (&).
449: %   o  \'e puts an accent agu over the letter e.  See the User Guide
450: %      and the sample files for details on formatting special
451: %      characters.  
452: %   o  "\ " after a period prevents LaTeX from interpreting the period 
453: %      as an end of a sentence.
454: %   o  \aj is a macro that expands to "Astron. J."  See the User Guide
455: %      for a full list of journal macros
456: %   o  \adassvii is a macro that expands to the full title, editor,
457: %      and publishing information for the ADASS VII conference
458: %      proceedings.  Such macros are defined for ADASS conferences I
459: %      through XI.
460: %   o  When referencing a paper in the current volume, use the
461: %      \adassxii and \paperref macros.  The argument to \paperref is
462: %      the paper ID code for the paper you are referencing.  See the 
463: %      note in the "Paper ID Code" section above for details on how to 
464: %      determine the paper ID code for the paper you reference.  
465: %
466: \begin{references}
467: 
468: \reference Accomazzi, A., Eichhorn, G., Kurtz, M.~J., 
469: Grant, C.~S., \& Murray, S.~S.\ 2000, \aaps, 143, 85
470: 
471: \reference Grant, C.~S., Accomazzi, 
472: A., Eichhorn, G., Kurtz, M.~J., \& Murray, S.~S.\ 2000, \aaps, 143, 111 
473:  
474: \reference Kurtz, M.~J., Eichhorn, 
475: G., Accomazzi, A., Grant, C.~S., Murray, S.~S., \& Watson, J.~M.\ 2000, 
476: \aaps, 143, 41 
477:  
478: \reference Kurtz, M.~J., Eichhorn, 
479: G., Accomazzi, A., Grant, C.~S., \& Murray, S.~S.\ 2002, Proc. SPIE, 4847, 
480: 238 
481: 
482: \reference McCallum, A., Nigam, K.\ 1998, 
483: AAAI-98 Workshop on Learning for Text Categorization,
484: \htmladdURL{http://www.cs.cmu.edu/\%7Ecmccallum}
485: 
486: \end{references}
487: 
488: % Do not place any material after the references section
489: 
490: \end{document}  % Leave intact
491: