cs0401028/kurtz.tex
1: % Submission version with embedded bibliography
2: 
3: \documentclass{svmult}
4: \usepackage{makeidx}     % allows index generation
5: \usepackage{graphicx}    % standard LaTeX graphics tool
6:                          % when including figure files
7: \usepackage{multicol}    % used for the two-column index
8: \makeindex
9: \def\under{\,|\,}
10: \def\argmax{\mathop{\rm argmax}}
11: 
12: \begin{document}
13: \title*{Automated Resolution of Noisy Bibliographic References}
14: \author{Markus Demleitner\inst{1,2}, Michael Kurtz\inst{2}, 
15: Alberto Accomazzi\inst{2},
16: G\"unther Eichhorn\inst{2},
17: Carolyn S.~Grant\inst{2},  Steven
18: S.~Murray\inst{2}}
19: 
20: \institute{Lehrstuhl f\"ur Computerlinguistik der Universit\"at Heidelberg,
21: Karlstr.~2, 69117 Heidelberg, Germany
22: \and
23: NASA Astrophysics Data System, Harvard-Smithsonian Center for
24: Astrophysics, 60 Garden Street, Cambridge, MA 02138, USA}
25: 
26: \authorrunning{Demleitner, Kurtz, et al}
27: \maketitle
28: \begin{abstract}
29: We describe a system used by the NASA Astrophysics Data System to
30: identify bibliographic references obtained from scanned article pages by
31: OCR methods with records in a bibliographic database.  We analyze the
32: process generating the noisy references and conclude that the three-step
33: procedure of correcting the OCR results, parsing the corrected string
34: and matching it against the database provides unsatisfactory results.
35: Instead, we propose a method that allows a controlled merging of
36: correction, parsing and matching, inspired by dependency grammars.  We
37: also report on the effectiveness of various heuristics that we have
38: employed to improve recall.
39: \end{abstract}
40: 
41: \section{Introduction}
42: 
43: The importance of linking scholarly publications to each other
44: has received increasing
45: attention with the growing availability of such materials in
46: electronic form (see, e.g., van de Sompel, 1999).  The use of
47: citations is probably the most straightforward
48: approach to generate such links.
49: 
50: However, most publications and authors still do not give machine readable 
51: publication identifiers like
52: DOIs in their reference sections. The automatic generation of links
53: from references therefore is a challenge even for recent literature.
54: Bergmark (2000), Lawrence et~al.~(1999) and 
55: Claivaz et~al.~(2001) investigate methods to solve this problem under
56: a record linkage point of view.
57: 
58: For historical literature, the situation is even worse in that not
59: even the ``clean'' reference strings as intended by the authors 
60: are usually available.  In 1999,
61: the NASA Astrophysics Data System (ADS, see Kurtz et~al., 2000) began to
62: gather reference
63: sections from scans of astronomical literature and subsequently 
64: processed them with OCR software.
65: This has yielded about three million references (Demleitner et~al., 1999), 
66: many of them with severe recognition errors.  We will call 
67: these references
68: \emph{noisy}, whereas references that were wrong in the original
69: publication will be denoted \emph{dangling}.  Noisy references show
70: the entire spectrum of classic OCR errors in addition to the usual
71: variations in citation style.  Consider the following examples:
72: 
73: \begin{quote}
74: Bidelman, W. P. 1951, Ap. J. "3, 304; Contr. McDonald Obs., No.
75: 199.\hfil\break
76: Eggen, 0. J. 195oa, Ap.J. III, 414; Contr. Lick Obs., Series II, No.
77: 27.\hfil\break
78: ---195ob, ibid. 112, 141; ibid., No. 30.\hfil\break
79: Huist, H. C. van de. 1950, Astrophys. J. 112,1.\hfil\break
80: 8tro\char'176mgren, B. 1956, Astron. J. 61, 45.\hfil\break
81: Morando, B. 1963, "Recherches sur les orbites de resonance, "in Proceedings of t
82: he First International Symposium on the Use of Artilicial Satellites for Geodesy
83: , Washington, D. C. (North- Holland Publishing Company, Amsterdam, 1963), p. 42.
84: \end{quote}
85: 
86: While our situation was worse than the one solved by the record
87: linkage approaches cited above, we had the advantage of being able to
88: restate the problem into a classification problem, since we were only
89: interested in resolving references to publications contained in the ADS'
90: abstract database -- or decide that the
91: target of the reference is not in the ADS.  This is basically a
92: classification problem in which there are (currently) 3.5~million categories.
93: 
94: A method to solve this problem was recently developed by
95: Takasu (2003) using Hidden Markov Models to both parse
96: and match noisy references (Takasu calls them
97: ``erroneous'').  While his approach is very different from ours,
98: we believe the ideas behind and our experiences with our resolver may
99: benefit other groups also facing the problem of resolution of noisy
100: references.
101: 
102: In the remainder of this paper, we will first state
103: the problem in a rather general setting, then discuss the basic ideas
104: of our approach in this framework, describe the heuristics we used to
105: improve resolution rates and their effectiveness and finally discuss
106: the performance of our system in the real world.
107: 
108: \section{Statement of the Problem}
109: 
110: \begin{figure}
111: \centering
112: \includegraphics{fig_kurtz.epsi}
113: \caption{A noisy channel model for references obtained from OCR. $F$
114: and $F'$ are a tuple-valued random variables, $S$ and $S'$ are
115: string-valued random variables.}
116: \label{noisychannel}
117: \end{figure}
118: 
119: Fig.~\ref{noisychannel} shows a noisy channel model for the generation
120: of a noisy reference from an original reference that 
121: corresponds to an entry in a bibliographic database.  In principle,
122: the resolving problem is obtaining 
123: \begin{equation}
124: \argmax_{F}P_m(F'\under
125: F)\,P_p(S\under F')\,P_r(S'\under S),\quad S\in\Sigma^\ast,
126: F'\in(\Sigma^\ast)^{n_f}
127: \label{argmax-eq}
128: \end{equation}for a given noisy reference $S'$.  Here,
129: $\Sigma$ is the base alphabet (in our case, we normalize everything to
130: 7-bit ASCII) and $n_f$ is the number of fields in a bibliographic
131: record.  The domain of random variable $F$ is the database
132: plus the special value $\emptyset$ for references that are missing
133: from the data base but nevertheless valid.
134: 
135: The straightforward approach of modeling each distribution
136: mentioned above separately and trying to compute (\ref{argmax-eq})
137: from back to front will not work very well.  To see why, let us briefly
138: examine each element in the channel.  
139: 
140: Under a typical model for an OCR system, $P_r(S'\under S)$, will have many likely
141: $S$ for any $S'$, since references do not follow
142: common language models\footnote{To give an
143: example, the sequence ``L1'' will have a
144: very low probability in normal text, but, depending on the reference
145: syntax employed by authors, could occur in up to 2.5\% of the
146: references in our sample (it is actually found in 1.7\% of the OCRed
147: strings).} and are hard to model in general because of
148: mixed languages and (as text goes) high entropy.
149: 
150: In contrast, the ``parsing'' distribution $P_p(S\under F')$ 
151: is sharply peaked at few values.  Although reference syntax is much less
152: uniform than one might wish, even regular grammars can cope with a large
153: portion of the references, avoiding ambiguity altogether.  Even if
154: the situation is not so simple in the presence of titles or with
155: monographs and conferences, the number of interpretations for a given
156: value of $S$ with nonvanishing likelihood will be in the tens.
157: 
158: In the matching step modeled by $P_m(F'\under F)$, we have a similar
159: situation.  For journal references, ambiguity is very low indeed, and
160: even for books this record linkage problem is harmless
161: with $P_m(F'\under F)$ sharply peaked on
162: at worst a few dozen $F$.  The main complication here is detecting the
163: case $F=\emptyset$.
164: 
165: So, while $P_m$ and $P_p$ have quite low conditional entropies,
166: the one of $P_r$ is very high.  This is unfortunate, because in
167: computing (\ref{argmax-eq}) one would generate many $S$ only to throw
168: them away when computing $P_p$ or $P_m$.
169: 
170: In this light, an attempt to resolve noisy references along the lines
171: of Accomazzi et~al.~(1999)'s suggestion for clean references -- which
172: boils down to computing $\argmax_F P_m\left(F\under \argmax_{F'}
173: P_p(F'|S)\right)$ -- is bound to fail when extended to noisy
174: references.
175: 
176: It is clear that there have to be better ways since the
177: conditional entropy of $P(F\under S')$ is rather low, as can
178: be seen from the fact
179: that a human can usually tell very quickly what the correct
180: interpretation for even a very noisy reference is, at least when
181: equipped with a bibliographic search engine like the ADS itself.
182: 
183: Takasu (2003) describes how Dual and Variable-length
184: output Hidden Markov Models can be used to model a combined
185: conditional distribution $P_{p,r}(F'\under S')$, thus exploiting that
186: many likely values of $S$ will not parse well and therefore have a low
187: combined probability.  The idea of combining distributions is
188: instrumental to our approach as well.
189: 
190: \section{Our Approach}
191: 
192: \subsection{Core resolution}
193: 
194: One foundation of our resolver comes from 
195: dependency grammars (Heringer, 1993) in
196: natural language processing, which are based
197: on the observation that given the ``head'' of a (natural
198: language) phrase (say, a verb), certain
199: ``slots'' need to be filled (e.g., eat will usually have to be
200: complemented with something that eats and something that is eaten).
201: 
202: In the domain of reference resolving, the equivalent of a phrase is
203: the reference.
204: As the head of this phrase, we chose the publication source, 
205: i.e., a journal or conference name, a book title, a
206: hint that a given publication is a Ph.D.~thesis or a preprint.  This
207: was done for three reasons.  Firstly, it is easy to
208: robustly extract this information from references in our domain,
209: secondly, there are relatively few possible heads (disregarding
210: monographs), and thirdly,
211: the publication source governs the grammar of the entire
212: reference.
213: 
214: For example, in addition to the publication year and
215: authors references to most journals 
216: need a volume and a page , while a
217: Ph.D.~thesis is complemented by a name of an institution, and
218: reports or documents from the ArXiv~preprint
219: servers may just take a single number.
220: 
221: Let us for now assume that references follow the regular
222: expression \emph{Author+ Year Rest}, where
223: Rest contains a mixture of alphabetic and numeric characters, and a
224: title is not given for parts of article collections
225: -- in astronomy, almost all references
226: follow this grammar.
227: A simple regular expression can identify the year with very close to 100\% recall
228: and precision even in noisy references, yielding a robust fielding of
229: the reference.
230: 
231: To find the head as defined above, we simply 
232: collect all alphabetic characters from the
233: Rest. The remaining numeric
234: information, i.e., all sequences of digits separated by non-digits,
235: are the fillers required by the head.  This exploits that 
236: fillers are almost always numeric and avoids dependency on syntactic
237: markers like commas that are very prone to misrecognition.  
238: Heads that have non-numeric fillers (mostly theses and monographs)
239: receive special treatment.
240: 
241: This head is matched against an authority file that
242: maps $N_t$ full titles and common
243: abbreviations for the sources known to the ADS to a ``bibstem''
244: (cf.~Grant et~al., 2000).  We select the $n$-best matching of these, where
245: $n=5$ proved a good choice.  
246: To assess the quality of a match, a string edit distance suffices.
247: The one we use is $$1-{(\Delta(a,h)-|a|)L(a,h)\over |h|},$$
248: where $a$ and $h$ are a string from the authority file and the head,
249: respectively, $\Delta(a,h)$ denotes the number of matching trigrams
250: from $h$ that are found in $a$, $L(a,h)$ is the plain Levenshtein
251: distance (Levenshtein, 1966) and $|\,.\,|$ is the length of the string.  
252: The worst-case runtime of this procedure is
253: $O(|h|\max(|a|)N_t\log N_t)$, but since we compute trigram
254: similarities first and compute Levenshtein distances only for
255: those $a$ having at least half as many trigrams in common with $h$ as
256: the best matching $a$,
257: typical run time will be of order
258: $O(|h|^2\log|h|)$.
259: 
260: This corresponds to maximizing
261: $P_{p,r}((\ldots,{\it source},\ldots)\under S')$, i.e., we derive a
262: distribution on publication sources directly from the noisy reference.
263: The conditional entropy of this distribution is relatively low, because 
264: there are few possible sources (order $10^4$) and the edit distance
265: induces a sharply peaked distribution.
266: 
267: For each bibstem, the number of slots and their
268: interpretation is known\footnote{Actually, we have an exception list
269: and normally assume two slots, volume and page.}, and we can
270: simply match the slots with the fillers or give educated guesses on
271: insertion or deletion errors based on our knowledge of the fillers
272: expected.  In the noisy channel model, this corresponds to greedily evaluating
273: $P_{p,r}(F'\under S',(\ldots,{\it
274: source},\ldots))$.  While in principle, the distribution would have a
275: rather high conditional entropy (e.g., many readings for the numerals
276: would have to be taken into account), it turns out that most of these
277: complications can be accounted for in the matching step, alleviating
278: the need to actually produce multiple $F'$, even more so since parsing
279: errors frequently resemble errors made by authors in assembling their
280: references, which are modeled in $P_a$.
281: 
282: If filling the slots with the available fillers is not possible,
283: the next best head is tried, otherwise, we have a complete fielded
284: record $f'$ that can be matched against the database using a $P_m$ to
285: be discussed shortly.  If this
286: matching is successful, the resolution process stops, otherwise, the
287: next best head is tried.
288: 
289: The matching has to be a fast operation since it is potentially tried
290: many times.  Fortunately, the bibliographic 
291: identifiers (bibcodes, see Grant et~al., 2000) used by the ADS are, for
292: serials, computable from the record in constant time, and thus,
293: matching requires a simple table lookup,
294: taking $O(\log N_r)$ time for $N_r$ records we have to match
295: against.
296: 
297: Due to the construction of bibcodes, the plain bibcode match only
298: checks the first character of the first authors' last name.  
299: The numbers below show that the entropy of
300: references with respect to the distribution implied by our algorithm
301: is so low that this shortcoming does not impact precision noticeably
302: -- put another way, the likelihood that OCR errors conspire to produce
303: a valid reference is very small even without using most of the
304: information from the author field.
305: 
306: The core resolving process typically runs in $O(\log N_r |h|^2\ln|h|)$ time.
307: On a 1400 MHz Athlon XP machine, a python script implementing this
308: resolves about 100~references per second and already catches more than
309: 84\% of the total resolvable references in our set of 3,027,801
310: noisy references.
311: 
312: \subsection{Reference Matching}
313: 
314: For journals for which the database can be assumed complete $P_m(F\under F')$ 
315: is nontrivial, i.e., different from $\delta_{F,F'}$.  The single most
316: important ingredient is a mapping from volume numbers to publication
317: years and vice versa, because even if one field is wrong 
318: because of either OCR or author errors, the other can be
319: reconstructed.  We also scan the surrounding page range (authors
320: surprisingly frequently use the last page of an article) and try
321: swapping adjacent digits in the page number.  
322: Finally, we try special sections of journals
323: (usually letter pages).  The definition of this matching implies that
324: $P_m(F=f\under F'=f')=0$ if $f$ and $f'$ differ in more than one field.
325: 
326: While these rules are somewhat ad hoc, they are also
327: straightforward and probably would not profit from learning.
328: They alone account for 8\% of the successfully resolved references
329: without further source string manipulation.
330: 
331: When any of these rules are applied, the authors given in the
332: reference are matched against those in the data base using 
333: a tailored string edit distance.  It is computed by
334: deleting initials, first names and common
335: non-author phrases (currently ``and'', ``et'', and ``al'') and
336: then evaluating $${\it fault}=\sum_{w'\in A'}\min_{w\in A} 
337: L(w',w),$$ where $L$ is the Levenshtein distance with all weights one
338: and $A$ and $A'$ the author last names for the paper in the database and
339: from the reference, respectively.  The edit distance then is $d_a=1-{\it fault}/{\it
340: limit}$, where {\it
341: limit} is given by allowing 2 errors for each word shorter than 5
342: characters, 3 errors for each word shorter than 10 characters and 4
343: errors otherwise.  This reflects that OCR
344: language models do much better on longer words than on shorter ones,
345: even if they come from non-English languages.  Unless we have reason
346: to be stricter (usually with monographs), we accept a match if
347: $d_a>0$.
348: 
349: If, after all string manipulations described below have not yielded a
350: match, we relax $P_m$ for all sources
351: and also try to match identifiers with a different
352: first author (in case the author order is wrong), scan a page range of
353: plausible mis-spellings and try identifiers with different
354: qualifiers\footnote{This is necessary if there is more than one
355: article mapping to the same bibcode on one page, for details see
356: Grant et~al.~(2000).}.  7.8\% of the total
357: resolved references were only accepted after this.  We have not
358: attempted to ascertain how many of these references were dangling in
359: the original publication.
360: 
361: \subsection{Monographs and Theses}
362: 
363: The procedures described above are useful for serials and article
364: collections of all kinds.  Two kinds of publications have to be
365: treated differently.
366: 
367: As mentioned above, theses have alphabetic fillers.  
368: Thus, we use keyword spotting (a
369: hand-tailored regular expression for possible readings of ``Thesis'')
370: to identify the head within the rest.  Together with the first
371: character of the author's last name and the publication year, 
372: we select a set of candidates and
373: match authors and granting institutions analogous to the author
374: matching procedure described above.
375: 
376: Monographs are completely outside this kind of handling. For them, a
377: set of candidates is selected based on the first character of the
378: author name and the publication year, and authors and titles are
379: matched.  Since this is a very time-consuming procedure, it is only
380: attempted if the resolving to serials failed.
381: 
382: Note that using authors as heads as is basically done with
383: monographs would probably most closely mimic the techniques of 
384: human librarians. However, given
385: the fragility of author names both in the OCR
386: process and in transliteration, we doubt that a low-entropy
387: distribution would result from doing so.
388: 
389: \section{Heuristics}
390: 
391: Takasu (2003) conjectured that the comparatively unsatisfactory
392: performance of his method could be significantly improved through the
393: use of a set of heuristics.  We find that the same is true for our
394: approach. Almost 16\% of the total resolved papers only become
395: resolvable by the algorithm outlined above after some heuristic
396: manipulations are performed on the noisy reference.
397: 
398: We apply a sequence of such manipulations
399: ordered according to their ``daringness'' and re-resolve after each
400: manipulation.  These manipulations -- typically regular-expression
401: based string operations -- model a noisy channel, but of course
402: it would be very hard to write down its governing distribution.
403: Still, it may be useful to see what heuristics had what
404: payoff.
405: 
406: In a first step, we correct the most frequently
407: misrecognized abbreviations based on regular expressions for the
408: errors.  We concentrate on abbreviations because misrecognitions in
409: longer words usually do not confuse our matching algorithm.
410: While better models may have a higher payoff, our method
411: only contributes
412: 0.6\% of the total resolved references.
413: 
414: The second step is more effective at 1.7\% of the total
415: resolved references.  We code rules about common misreadings of
416: numerals in a set of regular expressions, including substituting
417: numerals at the beginning of the reference using
418: a unigram model for OCR errors, fixing numerals within the reference string using a
419: hand-crafted bigram model and joining single digits to a preceding
420: group to make up for blank insertion errors.
421: 
422: At 4.9\% of the total still more effective are transformations 
423: on the alphabetic part behind the year, including
424: attempts to remove additional specifications (e.g., ``English
425: Translation''), and mostly very
426: domain-specific operations with the purpose of increasing the
427: conformity of journal specifications with the authority information
428: used by the source matcher.  The most important measure here, however,
429: is handling very short
430: publication names (``AJ'') that are particularly hard for the OCR.
431: From these experiences we believe a learning system will have to have a
432: special mode for short heads.
433: 
434: The last fixing step is dissecting the source specification along
435: separators (we use commas and colons) and try using the part that
436: yields the best match against the authority file
437: as the new head.  This usually removes bibliographic information
438: primarily in references to conference proceedings.  0.9\% of the total resolved
439: references become resolvable after this.  Note that this step would be
440: more important if we had to frequently deal with title removal.
441: 
442: Further, less interesting, heuristics are applied to bring references
443: into the format required by the resolver including title removal
444: -- for astronomy references,
445: this is rarely needed --, reconstruct references that refer to other reference's
446: parts, and to split reference lines containing two or more
447: references.  This last task
448: only applies to the rare entries consisting of two separate references
449: listed together by the author.  The resolver makes no attempt to discover
450: errors in line joining that were made earlier in the processing chain.
451: 
452: \section{Application}
453: 
454: Our dataset from OCR currently contains 3,027,801 references (some
455: $10^4$ of which actually consist of non-reference material
456: misclassified by the reference cutting engine).  Of these, 2,552,229
457: (or about 84\%) could be resolved to records in the database.
458: 
459: In order to assess recall and precision of the system described here,
460: we created a subset of 852 references
461: by selecting each reference with a probability of 0.00025, which yielded
462: 118 references that were not resolved and 734 that were resolved. 
463: We then manually resolved each selected reference, correcting dangling
464: references as best we could.  Thus, the following numbers compare the
465: resolver's $P(F\under S')$ with a human's $P(F\under S')$.
466: 
467: The result was that two of the 734 resolved records were incorrectly
468: resolved.  In both cases, the correct record was not in the ADS, which
469: illustrates that the $F=\emptyset$ problem dominates the issue of
470: precision.
471: Of the non-resolved records, 94 were not in
472: our database, while 23 were, though six of these were marked doubtful
473: by the human resolvers.  Counting doubtful cases as errors, we
474: thus have a precision of more than 99\% and a recall of about 97\%.  
475: Of the 17 definite
476: false negatives, 7 are severely dangling or excessively noisy references to journals, while 6 are
477: references to conference proceedings and the rest monographs.
478: 
479: Note that it is highly unlikely that any of the drawn references were
480: ever inspected during the development of the heuristics.  Still, one
481: might question if evaluating the resolver with data that at least
482: might have been used to ``train'' it is justified.
483: Since during development we mainly inspected resolving 
484: failures rather than possibly incorrectly resolved references, we
485: would expect the fact the we did not hold back pristine reference data
486: for evaluation purposes to impact recall more than precision.
487: 
488: For journal literature between 1981 and 1998, we also compared 
489: the resolver result with data purchased
490: from ISI's science citation index\footnote{See http://www.isinet.com/}.
491: Randomly selecting 1\% of the articles covered by
492: ISI and removing references to sources outside the 
493: ISI sample, we had 10832 citing-cited
494: pairs, of which 311 were missing in the OCR sample and 1151 were
495: missing from ISI.
496: 
497: A manual examination of the citing-cited pairs missing from the OCR
498: sample revealed that 112
499: were really attributable to the resolver, 107 were due to incorrect
500: reconstructions of reference lines, and 86 references were missed because
501: the references were not found by the reference zone identification.
502: 
503: Of the references apparently missing from ISI, 2 were due to 
504: resolver errors\footnote{Actually, in one case the OCR conspired to
505: produce an almost valid reference to a wrong paper, in the second
506: case, incorrect line joining resulted in two references that were
507: mangled into a valid one.}, and less than 20\% were dangling references that
508: ISI did not correct, but were clearly identifiable nevertheless.  
509: We have not attempted to identify why the other (correct) pairs
510: were missing from our ISI sample; most problems probably
511: were introduced during the necessarily conservative matchup between
512: records from ISI and the ADS, and possibly in the selection of our data set
513: from ISI's data base.
514: 
515: For journal articles (others are, for the most part, 
516: not available from ISI), we can thus state a recall of 99\% and a
517: precision of 99.9\% for our resolver and a recall of about 97\% for
518: the complete system.
519: 
520: \section{Discussion}
521: 
522: In this paper we contend that robust interpretation of bibliographic 
523: references, as
524: required when resolving references obtained by current OCR techniques,
525: should integrate as much information obtainable from a set of known
526: publications as possible even in parsing and not delay incorporating
527: this information to a ``matching'' or linkage phase.
528: 
529: Our approach has been inspired by dependency grammars, in which a head
530: of a phrase governs the interpretation of the remaining elements.  For
531: (noisy) references, it is advantageous to use the name or type
532: of the publication as head.
533: The existence of
534: bibliographic identifiers that are for most references easily computable from
535: fielded records has been instrumental for the performance of our
536: system.
537: 
538: While we believe some of the rather ad hoc string manipulations and
539: edit distances employed by our current system can and should be
540: substituted by sound and learning algorithms, it seems evident to us
541: that a certain degree of domain-specific knowledge (most notably, a
542: mapping between publication dates and volumes) is very important for
543: robust resolving.
544: 
545: The system discussed here has been in continuous use at the ADS for
546: the past four years, for noisy references from OCR as well as for
547: references from digital sources.  The ADS
548: in turn is arguably the most important bibliographic tool in astronomy and
549: astrophysics.  The fact that the ADS has received very few complaints
550: concerning the accuracy of its citations backs the estimates 
551: of recall and precision given above.
552: 
553: 
554: \begin{acknowledgement}
555: We wish to thank Regina Weineck for help in the generation of validation
556: data.
557: 
558: The NASA Astrophysics Data System is funded
559: by NASA Grant NCC5-189.
560: \end{acknowledgement}
561: 
562: \begin{thebibliography}{03}
563: 
564: \bibitem{accomazzi1999}
565: Accomazzi, A., Eichhorn, G., Kurtz, M., Grant, C., and Murray, S.
566: (1999). ``The ADS Bibliographic Reference Resolver.''  
567: In   {\em
568:   Astronomical Data Analysis Software and Systems VIII}, 
569:   R.~L. Plante, and D.~A. Roberts (eds.), Vol. 172 of {\em ASP
570:   Conference Series} p.~291-294
571: 
572: \bibitem{Bergmark2000}
573: Bergmark, D. (2000).
574: {\em Automatic Extraction of Reference Linking Information from
575:   Online Documents}.
576: Technical Report TR 2000-1821, Computer Science Department, Cornell
577:   University
578: 
579: \bibitem{claivaz2001cern}
580: Claivaz, J.-B., Meur, J.-Y.~L., and Robinson, N. (2001).
581: ``From Fulltext Documents to Structured Citations: CERN's Automated
582: Solution,'' {\em HEP Libraries Webzine} 5
583: (http://doc.cern.ch/heplw/5/papers/2/)
584: 
585: \bibitem{Demleitner1999}
586: {{Demleitner}, M. and {Accomazzi}, A. and {Eichhorn}, G. and {Grant}, C.~S. and
587:   {Kurtz}, M.~J. and {Murray}, S.~S.} (1999). ``Looking at 3,000,000
588:   Referencenes Without Growing Grey Hair,''
589: {\em {Bulletin of the American Astronomical Society}} 31, 1496
590: 
591: \bibitem{Grant2000}
592: {Grant}, C.~S., {Accomazzi}, A., {Eichhorn}, G., {Kurtz}, M.~J., and {Murray},
593:   S.~S. (2000). ``The NASA Astrophysics Data System: Data holdings,''
594:  {\em Astronomy and Astrophysics Supplement} {\bf 143}, 111-135
595: 
596: \bibitem{heringer1993dependency}
597: Heringer, H.~J. (1993). ``Dependency syntax -- basic ideas and the
598: classical model.''
599: In {\em Syntax - An International Handbook of Contemporary Research, volume 1}, J. Jacobs, A. von Stechow, W. Sternefeld, and T. Venneman (eds.),
600:   Walter de Gruyter, Berlin, New York, pp 298--316. 
601: 
602: \bibitem{Kurtz2000ADS}
603: {Kurtz}, M.~J., {Eichhorn}, G., {Accomazzi}, A., {Grant}, C.~S., {Murray},
604:   S.~S., and {Watson}, J.~M. (2000). ``The NASA Astrophysics Data
605:   System: Overview,'' {\em Astronomy and Astrophysics Supplement} {\bf
606:   143}, 41--59
607: 
608: \bibitem{lawrence99digital}
609: Lawrence, S., Giles, C.~L., and Bollacker, K. (1999), ``Digital
610: Libraries and {Autonomous Citation Indexing},''
611: {\em IEEE Computer} {\bf 32(6)}, 67--71
612: 
613: \bibitem{levenshtein}
614: Levenshtein, V.~I. (1966). ``Binary codes capable of correcting
615: deletions, insertions and reversals,''
616: {\em Soviet Physics Doklady} {\bf 10}, 707--710
617: 
618: \bibitem{takasu2003erroneous}
619: Takasu, A. (2003). ``Bibliographic attribute extraction from erroneous
620: references based on a statistical model.'' In {\em Proceedings of the third ACM/IEEE-CS joint conference on
621:   Digital libraries}, pp 49--60
622: 
623: \bibitem{vandesompel1999linking}
624: van de~Sompel, H.~V. and Hochstenbach, P. (1999). ``Reference Linking
625: in a Hybrid Library Environment,''
626: {\em D-Lib Magazine} 5(4)
627: 
628: \end{thebibliography}
629: \end{document}
630: