0401:cs0401028/kurtz.tex

1: % Submission version with embedded bibliography

2:

3: \documentclass{svmult}

4: \usepackage{makeidx}     % allows index generation

5: \usepackage{graphicx}    % standard LaTeX graphics tool

6:                          % when including figure files

7: \usepackage{multicol}    % used for the two-column index

8: \makeindex

9: \def\under{\,|\,}

10: \def\argmax{\mathop{\rm argmax}}

11:

12: \begin{document}

13: \title*{Automated Resolution of Noisy Bibliographic References}

14: \author{Markus Demleitner\inst{1,2}, Michael Kurtz\inst{2},

15: Alberto Accomazzi\inst{2},

16: G\"unther Eichhorn\inst{2},

17: Carolyn S.~Grant\inst{2},  Steven

18: S.~Murray\inst{2}}

19:

20: \institute{Lehrstuhl f\"ur Computerlinguistik der Universit\"at Heidelberg,

21: Karlstr.~2, 69117 Heidelberg, Germany

22: \and

23: NASA Astrophysics Data System, Harvard-Smithsonian Center for

24: Astrophysics, 60 Garden Street, Cambridge, MA 02138, USA}

25:

26: \authorrunning{Demleitner, Kurtz, et al}

27: \maketitle

28: \begin{abstract}

29: We describe a system used by the NASA Astrophysics Data System to

30: identify bibliographic references obtained from scanned article pages by

31: OCR methods with records in a bibliographic database.  We analyze the

32: process generating the noisy references and conclude that the three-step

33: procedure of correcting the OCR results, parsing the corrected string

34: and matching it against the database provides unsatisfactory results.

35: Instead, we propose a method that allows a controlled merging of

36: correction, parsing and matching, inspired by dependency grammars.  We

37: also report on the effectiveness of various heuristics that we have

38: employed to improve recall.

39: \end{abstract}

40:

41: \section{Introduction}

42:

43: The importance of linking scholarly publications to each other

44: has received increasing

45: attention with the growing availability of such materials in

46: electronic form (see, e.g., van de Sompel, 1999).  The use of

47: citations is probably the most straightforward

48: approach to generate such links.

49:

50: However, most publications and authors still do not give machine readable

51: publication identifiers like

52: DOIs in their reference sections. The automatic generation of links

53: from references therefore is a challenge even for recent literature.

54: Bergmark (2000), Lawrence et~al.~(1999) and

55: Claivaz et~al.~(2001) investigate methods to solve this problem under

56: a record linkage point of view.

57:

58: For historical literature, the situation is even worse in that not

59: even the ``clean'' reference strings as intended by the authors

60: are usually available.  In 1999,

61: the NASA Astrophysics Data System (ADS, see Kurtz et~al., 2000) began to

62: gather reference

63: sections from scans of astronomical literature and subsequently

64: processed them with OCR software.

65: This has yielded about three million references (Demleitner et~al., 1999),

66: many of them with severe recognition errors.  We will call

67: these references

68: \emph{noisy}, whereas references that were wrong in the original

69: publication will be denoted \emph{dangling}.  Noisy references show

70: the entire spectrum of classic OCR errors in addition to the usual

71: variations in citation style.  Consider the following examples:

72:

73: \begin{quote}

74: Bidelman, W. P. 1951, Ap. J. "3, 304; Contr. McDonald Obs., No.

75: 199.\hfil\break

76: Eggen, 0. J. 195oa, Ap.J. III, 414; Contr. Lick Obs., Series II, No.

77: 27.\hfil\break

78: ---195ob, ibid. 112, 141; ibid., No. 30.\hfil\break

79: Huist, H. C. van de. 1950, Astrophys. J. 112,1.\hfil\break

80: 8tro\char'176mgren, B. 1956, Astron. J. 61, 45.\hfil\break

81: Morando, B. 1963, "Recherches sur les orbites de resonance, "in Proceedings of t

82: he First International Symposium on the Use of Artilicial Satellites for Geodesy

83: , Washington, D. C. (North- Holland Publishing Company, Amsterdam, 1963), p. 42.

84: \end{quote}

85:

86: While our situation was worse than the one solved by the record

87: linkage approaches cited above, we had the advantage of being able to

88: restate the problem into a classification problem, since we were only

89: interested in resolving references to publications contained in the ADS'

90: abstract database -- or decide that the

91: target of the reference is not in the ADS.  This is basically a

92: classification problem in which there are (currently) 3.5~million categories.

93:

94: A method to solve this problem was recently developed by

95: Takasu (2003) using Hidden Markov Models to both parse

96: and match noisy references (Takasu calls them

97: ``erroneous'').  While his approach is very different from ours,

98: we believe the ideas behind and our experiences with our resolver may

99: benefit other groups also facing the problem of resolution of noisy

100: references.

101:

102: In the remainder of this paper, we will first state

103: the problem in a rather general setting, then discuss the basic ideas

104: of our approach in this framework, describe the heuristics we used to

105: improve resolution rates and their effectiveness and finally discuss

106: the performance of our system in the real world.

107:

108: \section{Statement of the Problem}

109:

110: \begin{figure}

111: \centering

112: \includegraphics{fig_kurtz.epsi}

113: \caption{A noisy channel model for references obtained from OCR. $F$

114: and $F'$ are a tuple-valued random variables, $S$ and $S'$ are

115: string-valued random variables.}

116: \label{noisychannel}

117: \end{figure}

118:

119: Fig.~\ref{noisychannel} shows a noisy channel model for the generation

120: of a noisy reference from an original reference that

121: corresponds to an entry in a bibliographic database.  In principle,

122: the resolving problem is obtaining

123: \begin{equation}

124: \argmax_{F}P_m(F'\under

125: F)\,P_p(S\under F')\,P_r(S'\under S),\quad S\in\Sigma^\ast,

126: F'\in(\Sigma^\ast)^{n_f}

127: \label{argmax-eq}

128: \end{equation}for a given noisy reference $S'$.  Here,

129: $\Sigma$ is the base alphabet (in our case, we normalize everything to

130: 7-bit ASCII) and $n_f$ is the number of fields in a bibliographic

131: record.  The domain of random variable $F$ is the database

132: plus the special value $\emptyset$ for references that are missing

133: from the data base but nevertheless valid.

134:

135: The straightforward approach of modeling each distribution

136: mentioned above separately and trying to compute (\ref{argmax-eq})

137: from back to front will not work very well.  To see why, let us briefly

138: examine each element in the channel.

139:

140: Under a typical model for an OCR system, $P_r(S'\under S)$, will have many likely

141: $S$ for any $S'$, since references do not follow

142: common language models\footnote{To give an

143: example, the sequence ``L1'' will have a

144: very low probability in normal text, but, depending on the reference

145: syntax employed by authors, could occur in up to 2.5\% of the

146: references in our sample (it is actually found in 1.7\% of the OCRed

147: strings).} and are hard to model in general because of

148: mixed languages and (as text goes) high entropy.

149:

150: In contrast, the ``parsing'' distribution $P_p(S\under F')$

151: is sharply peaked at few values.  Although reference syntax is much less

152: uniform than one might wish, even regular grammars can cope with a large

153: portion of the references, avoiding ambiguity altogether.  Even if

154: the situation is not so simple in the presence of titles or with

155: monographs and conferences, the number of interpretations for a given

156: value of $S$ with nonvanishing likelihood will be in the tens.

157:

158: In the matching step modeled by $P_m(F'\under F)$, we have a similar

159: situation.  For journal references, ambiguity is very low indeed, and

160: even for books this record linkage problem is harmless

161: with $P_m(F'\under F)$ sharply peaked on

162: at worst a few dozen $F$.  The main complication here is detecting the

163: case $F=\emptyset$.

164:

165: So, while $P_m$ and $P_p$ have quite low conditional entropies,

166: the one of $P_r$ is very high.  This is unfortunate, because in

167: computing (\ref{argmax-eq}) one would generate many $S$ only to throw

168: them away when computing $P_p$ or $P_m$.

169:

170: In this light, an attempt to resolve noisy references along the lines

171: of Accomazzi et~al.~(1999)'s suggestion for clean references -- which

172: boils down to computing $\argmax_F P_m\left(F\under \argmax_{F'}

173: P_p(F'|S)\right)$ -- is bound to fail when extended to noisy

174: references.

175:

176: It is clear that there have to be better ways since the

177: conditional entropy of $P(F\under S')$ is rather low, as can

178: be seen from the fact

179: that a human can usually tell very quickly what the correct

180: interpretation for even a very noisy reference is, at least when

181: equipped with a bibliographic search engine like the ADS itself.

182:

183: Takasu (2003) describes how Dual and Variable-length

184: output Hidden Markov Models can be used to model a combined

185: conditional distribution $P_{p,r}(F'\under S')$, thus exploiting that

186: many likely values of $S$ will not parse well and therefore have a low

187: combined probability.  The idea of combining distributions is

188: instrumental to our approach as well.

189:

190: \section{Our Approach}

191:

192: \subsection{Core resolution}

193:

194: One foundation of our resolver comes from

195: dependency grammars (Heringer, 1993) in

196: natural language processing, which are based

197: on the observation that given the ``head'' of a (natural

198: language) phrase (say, a verb), certain

199: ``slots'' need to be filled (e.g., eat will usually have to be

200: complemented with something that eats and something that is eaten).

201:

202: In the domain of reference resolving, the equivalent of a phrase is

203: the reference.

204: As the head of this phrase, we chose the publication source,

205: i.e., a journal or conference name, a book title, a

206: hint that a given publication is a Ph.D.~thesis or a preprint.  This

207: was done for three reasons.  Firstly, it is easy to

208: robustly extract this information from references in our domain,

209: secondly, there are relatively few possible heads (disregarding

210: monographs), and thirdly,

211: the publication source governs the grammar of the entire

212: reference.

213:

214: For example, in addition to the publication year and

215: authors references to most journals

216: need a volume and a page , while a

217: Ph.D.~thesis is complemented by a name of an institution, and

218: reports or documents from the ArXiv~preprint

219: servers may just take a single number.

220:

221: Let us for now assume that references follow the regular

222: expression \emph{Author+ Year Rest}, where

223: Rest contains a mixture of alphabetic and numeric characters, and a

224: title is not given for parts of article collections

225: -- in astronomy, almost all references

226: follow this grammar.

227: A simple regular expression can identify the year with very close to 100\% recall

228: and precision even in noisy references, yielding a robust fielding of

229: the reference.

230:

231: To find the head as defined above, we simply

232: collect all alphabetic characters from the

233: Rest. The remaining numeric

234: information, i.e., all sequences of digits separated by non-digits,

235: are the fillers required by the head.  This exploits that

236: fillers are almost always numeric and avoids dependency on syntactic

237: markers like commas that are very prone to misrecognition.

238: Heads that have non-numeric fillers (mostly theses and monographs)

239: receive special treatment.

240:

241: This head is matched against an authority file that

242: maps $N_t$ full titles and common

243: abbreviations for the sources known to the ADS to a ``bibstem''

244: (cf.~Grant et~al., 2000).  We select the $n$-best matching of these, where

245: $n=5$ proved a good choice.

246: To assess the quality of a match, a string edit distance suffices.

247: The one we use is $$1-{(\Delta(a,h)-|a|)L(a,h)\over |h|},$$

248: where $a$ and $h$ are a string from the authority file and the head,

249: respectively, $\Delta(a,h)$ denotes the number of matching trigrams

250: from $h$ that are found in $a$, $L(a,h)$ is the plain Levenshtein

251: distance (Levenshtein, 1966) and $|\,.\,|$ is the length of the string.

252: The worst-case runtime of this procedure is

253: $O(|h|\max(|a|)N_t\log N_t)$, but since we compute trigram

254: similarities first and compute Levenshtein distances only for

255: those $a$ having at least half as many trigrams in common with $h$ as

256: the best matching $a$,

257: typical run time will be of order

258: $O(|h|^2\log|h|)$.

259:

260: This corresponds to maximizing

261: $P_{p,r}((\ldots,{\it source},\ldots)\under S')$, i.e., we derive a

262: distribution on publication sources directly from the noisy reference.

263: The conditional entropy of this distribution is relatively low, because

264: there are few possible sources (order $10^4$) and the edit distance

265: induces a sharply peaked distribution.

266:

267: For each bibstem, the number of slots and their

268: interpretation is known\footnote{Actually, we have an exception list

269: and normally assume two slots, volume and page.}, and we can

270: simply match the slots with the fillers or give educated guesses on

271: insertion or deletion errors based on our knowledge of the fillers

272: expected.  In the noisy channel model, this corresponds to greedily evaluating

273: $P_{p,r}(F'\under S',(\ldots,{\it

274: source},\ldots))$.  While in principle, the distribution would have a

275: rather high conditional entropy (e.g., many readings for the numerals

276: would have to be taken into account), it turns out that most of these

277: complications can be accounted for in the matching step, alleviating

278: the need to actually produce multiple $F'$, even more so since parsing

279: errors frequently resemble errors made by authors in assembling their

280: references, which are modeled in $P_a$.

281:

282: If filling the slots with the available fillers is not possible,

283: the next best head is tried, otherwise, we have a complete fielded

284: record $f'$ that can be matched against the database using a $P_m$ to

285: be discussed shortly.  If this

286: matching is successful, the resolution process stops, otherwise, the

287: next best head is tried.

288:

289: The matching has to be a fast operation since it is potentially tried

290: many times.  Fortunately, the bibliographic

291: identifiers (bibcodes, see Grant et~al., 2000) used by the ADS are, for

292: serials, computable from the record in constant time, and thus,

293: matching requires a simple table lookup,

294: taking $O(\log N_r)$ time for $N_r$ records we have to match

295: against.

296:

297: Due to the construction of bibcodes, the plain bibcode match only

298: checks the first character of the first authors' last name.

299: The numbers below show that the entropy of

300: references with respect to the distribution implied by our algorithm

301: is so low that this shortcoming does not impact precision noticeably

302: -- put another way, the likelihood that OCR errors conspire to produce

303: a valid reference is very small even without using most of the

304: information from the author field.

305:

306: The core resolving process typically runs in $O(\log N_r |h|^2\ln|h|)$ time.

307: On a 1400 MHz Athlon XP machine, a python script implementing this

308: resolves about 100~references per second and already catches more than

309: 84\% of the total resolvable references in our set of 3,027,801

310: noisy references.

311:

312: \subsection{Reference Matching}

313:

314: For journals for which the database can be assumed complete $P_m(F\under F')$

315: is nontrivial, i.e., different from $\delta_{F,F'}$.  The single most

316: important ingredient is a mapping from volume numbers to publication

317: years and vice versa, because even if one field is wrong

318: because of either OCR or author errors, the other can be

319: reconstructed.  We also scan the surrounding page range (authors

320: surprisingly frequently use the last page of an article) and try

321: swapping adjacent digits in the page number.

322: Finally, we try special sections of journals

323: (usually letter pages).  The definition of this matching implies that

324: $P_m(F=f\under F'=f')=0$ if $f$ and $f'$ differ in more than one field.

325:

326: While these rules are somewhat ad hoc, they are also

327: straightforward and probably would not profit from learning.

328: They alone account for 8\% of the successfully resolved references

329: without further source string manipulation.

330:

331: When any of these rules are applied, the authors given in the

332: reference are matched against those in the data base using

333: a tailored string edit distance.  It is computed by

334: deleting initials, first names and common

335: non-author phrases (currently ``and'', ``et'', and ``al'') and

336: then evaluating $${\it fault}=\sum_{w'\in A'}\min_{w\in A}

337: L(w',w),$$ where $L$ is the Levenshtein distance with all weights one

338: and $A$ and $A'$ the author last names for the paper in the database and

339: from the reference, respectively.  The edit distance then is $d_a=1-{\it fault}/{\it

340: limit}$, where {\it

341: limit} is given by allowing 2 errors for each word shorter than 5

342: characters, 3 errors for each word shorter than 10 characters and 4

343: errors otherwise.  This reflects that OCR

344: language models do much better on longer words than on shorter ones,

345: even if they come from non-English languages.  Unless we have reason

346: to be stricter (usually with monographs), we accept a match if

347: $d_a>0$.

348:

349: If, after all string manipulations described below have not yielded a

350: match, we relax $P_m$ for all sources

351: and also try to match identifiers with a different

352: first author (in case the author order is wrong), scan a page range of

353: plausible mis-spellings and try identifiers with different

354: qualifiers\footnote{This is necessary if there is more than one

355: article mapping to the same bibcode on one page, for details see

356: Grant et~al.~(2000).}.  7.8\% of the total

357: resolved references were only accepted after this.  We have not

358: attempted to ascertain how many of these references were dangling in

359: the original publication.

360:

361: \subsection{Monographs and Theses}

362:

363: The procedures described above are useful for serials and article

364: collections of all kinds.  Two kinds of publications have to be

365: treated differently.

366:

367: As mentioned above, theses have alphabetic fillers.

368: Thus, we use keyword spotting (a

369: hand-tailored regular expression for possible readings of ``Thesis'')

370: to identify the head within the rest.  Together with the first

371: character of the author's last name and the publication year,

372: we select a set of candidates and

373: match authors and granting institutions analogous to the author

374: matching procedure described above.

375:

376: Monographs are completely outside this kind of handling. For them, a

377: set of candidates is selected based on the first character of the

378: author name and the publication year, and authors and titles are

379: matched.  Since this is a very time-consuming procedure, it is only

380: attempted if the resolving to serials failed.

381:

382: Note that using authors as heads as is basically done with

383: monographs would probably most closely mimic the techniques of

384: human librarians. However, given

385: the fragility of author names both in the OCR

386: process and in transliteration, we doubt that a low-entropy

387: distribution would result from doing so.

388:

389: \section{Heuristics}

390:

391: Takasu (2003) conjectured that the comparatively unsatisfactory

392: performance of his method could be significantly improved through the

393: use of a set of heuristics.  We find that the same is true for our

394: approach. Almost 16\% of the total resolved papers only become

395: resolvable by the algorithm outlined above after some heuristic

396: manipulations are performed on the noisy reference.

397:

398: We apply a sequence of such manipulations

399: ordered according to their ``daringness'' and re-resolve after each

400: manipulation.  These manipulations -- typically regular-expression

401: based string operations -- model a noisy channel, but of course

402: it would be very hard to write down its governing distribution.

403: Still, it may be useful to see what heuristics had what

404: payoff.

405:

406: In a first step, we correct the most frequently

407: misrecognized abbreviations based on regular expressions for the

408: errors.  We concentrate on abbreviations because misrecognitions in

409: longer words usually do not confuse our matching algorithm.

410: While better models may have a higher payoff, our method

411: only contributes

412: 0.6\% of the total resolved references.

413:

414: The second step is more effective at 1.7\% of the total

415: resolved references.  We code rules about common misreadings of

416: numerals in a set of regular expressions, including substituting

417: numerals at the beginning of the reference using

418: a unigram model for OCR errors, fixing numerals within the reference string using a

419: hand-crafted bigram model and joining single digits to a preceding

420: group to make up for blank insertion errors.

421:

422: At 4.9\% of the total still more effective are transformations

423: on the alphabetic part behind the year, including

424: attempts to remove additional specifications (e.g., ``English

425: Translation''), and mostly very

426: domain-specific operations with the purpose of increasing the

427: conformity of journal specifications with the authority information

428: used by the source matcher.  The most important measure here, however,

429: is handling very short

430: publication names (``AJ'') that are particularly hard for the OCR.

431: From these experiences we believe a learning system will have to have a

432: special mode for short heads.

433:

434: The last fixing step is dissecting the source specification along

435: separators (we use commas and colons) and try using the part that

436: yields the best match against the authority file

437: as the new head.  This usually removes bibliographic information

438: primarily in references to conference proceedings.  0.9\% of the total resolved

439: references become resolvable after this.  Note that this step would be

440: more important if we had to frequently deal with title removal.

441:

442: Further, less interesting, heuristics are applied to bring references

443: into the format required by the resolver including title removal

444: -- for astronomy references,

445: this is rarely needed --, reconstruct references that refer to other reference's

446: parts, and to split reference lines containing two or more

447: references.  This last task

448: only applies to the rare entries consisting of two separate references

449: listed together by the author.  The resolver makes no attempt to discover

450: errors in line joining that were made earlier in the processing chain.

451:

452: \section{Application}

453:

454: Our dataset from OCR currently contains 3,027,801 references (some

455: $10^4$ of which actually consist of non-reference material

456: misclassified by the reference cutting engine).  Of these, 2,552,229

457: (or about 84\%) could be resolved to records in the database.

458:

459: In order to assess recall and precision of the system described here,

460: we created a subset of 852 references

461: by selecting each reference with a probability of 0.00025, which yielded

462: 118 references that were not resolved and 734 that were resolved.

463: We then manually resolved each selected reference, correcting dangling

464: references as best we could.  Thus, the following numbers compare the

465: resolver's $P(F\under S')$ with a human's $P(F\under S')$.

466:

467: The result was that two of the 734 resolved records were incorrectly

468: resolved.  In both cases, the correct record was not in the ADS, which

469: illustrates that the $F=\emptyset$ problem dominates the issue of

470: precision.

471: Of the non-resolved records, 94 were not in

472: our database, while 23 were, though six of these were marked doubtful

473: by the human resolvers.  Counting doubtful cases as errors, we

474: thus have a precision of more than 99\% and a recall of about 97\%.

475: Of the 17 definite

476: false negatives, 7 are severely dangling or excessively noisy references to journals, while 6 are

477: references to conference proceedings and the rest monographs.

478:

479: Note that it is highly unlikely that any of the drawn references were

480: ever inspected during the development of the heuristics.  Still, one

481: might question if evaluating the resolver with data that at least

482: might have been used to ``train'' it is justified.

483: Since during development we mainly inspected resolving

484: failures rather than possibly incorrectly resolved references, we

485: would expect the fact the we did not hold back pristine reference data

486: for evaluation purposes to impact recall more than precision.

487:

488: For journal literature between 1981 and 1998, we also compared

489: the resolver result with data purchased

490: from ISI's science citation index\footnote{See http://www.isinet.com/}.

491: Randomly selecting 1\% of the articles covered by

492: ISI and removing references to sources outside the

493: ISI sample, we had 10832 citing-cited

494: pairs, of which 311 were missing in the OCR sample and 1151 were

495: missing from ISI.

496:

497: A manual examination of the citing-cited pairs missing from the OCR

498: sample revealed that 112

499: were really attributable to the resolver, 107 were due to incorrect

500: reconstructions of reference lines, and 86 references were missed because

501: the references were not found by the reference zone identification.

502:

503: Of the references apparently missing from ISI, 2 were due to

504: resolver errors\footnote{Actually, in one case the OCR conspired to

505: produce an almost valid reference to a wrong paper, in the second

506: case, incorrect line joining resulted in two references that were

507: mangled into a valid one.}, and less than 20\% were dangling references that

508: ISI did not correct, but were clearly identifiable nevertheless.

509: We have not attempted to identify why the other (correct) pairs

510: were missing from our ISI sample; most problems probably

511: were introduced during the necessarily conservative matchup between

512: records from ISI and the ADS, and possibly in the selection of our data set

513: from ISI's data base.

514:

515: For journal articles (others are, for the most part,

516: not available from ISI), we can thus state a recall of 99\% and a

517: precision of 99.9\% for our resolver and a recall of about 97\% for

518: the complete system.

519:

520: \section{Discussion}

521:

522: In this paper we contend that robust interpretation of bibliographic

523: references, as

524: required when resolving references obtained by current OCR techniques,

525: should integrate as much information obtainable from a set of known

526: publications as possible even in parsing and not delay incorporating

527: this information to a ``matching'' or linkage phase.

528:

529: Our approach has been inspired by dependency grammars, in which a head

530: of a phrase governs the interpretation of the remaining elements.  For

531: (noisy) references, it is advantageous to use the name or type

532: of the publication as head.

533: The existence of

534: bibliographic identifiers that are for most references easily computable from

535: fielded records has been instrumental for the performance of our

536: system.

537:

538: While we believe some of the rather ad hoc string manipulations and

539: edit distances employed by our current system can and should be

540: substituted by sound and learning algorithms, it seems evident to us

541: that a certain degree of domain-specific knowledge (most notably, a

542: mapping between publication dates and volumes) is very important for

543: robust resolving.

544:

545: The system discussed here has been in continuous use at the ADS for

546: the past four years, for noisy references from OCR as well as for

547: references from digital sources.  The ADS

548: in turn is arguably the most important bibliographic tool in astronomy and

549: astrophysics.  The fact that the ADS has received very few complaints

550: concerning the accuracy of its citations backs the estimates

551: of recall and precision given above.

552:

553:

554: \begin{acknowledgement}

555: We wish to thank Regina Weineck for help in the generation of validation

556: data.

557:

558: The NASA Astrophysics Data System is funded

559: by NASA Grant NCC5-189.

560: \end{acknowledgement}

561:

562: \begin{thebibliography}{03}

563:

564: \bibitem{accomazzi1999}

565: Accomazzi, A., Eichhorn, G., Kurtz, M., Grant, C., and Murray, S.

566: (1999). ``The ADS Bibliographic Reference Resolver.''

567: In   {\em

568:   Astronomical Data Analysis Software and Systems VIII},

569:   R.~L. Plante, and D.~A. Roberts (eds.), Vol. 172 of {\em ASP

570:   Conference Series} p.~291-294

571:

572: \bibitem{Bergmark2000}

573: Bergmark, D. (2000).

574: {\em Automatic Extraction of Reference Linking Information from

575:   Online Documents}.

576: Technical Report TR 2000-1821, Computer Science Department, Cornell

577:   University

578:

579: \bibitem{claivaz2001cern}

580: Claivaz, J.-B., Meur, J.-Y.~L., and Robinson, N. (2001).

581: ``From Fulltext Documents to Structured Citations: CERN's Automated

582: Solution,'' {\em HEP Libraries Webzine} 5

583: (http://doc.cern.ch/heplw/5/papers/2/)

584:

585: \bibitem{Demleitner1999}

586: {{Demleitner}, M. and {Accomazzi}, A. and {Eichhorn}, G. and {Grant}, C.~S. and

587:   {Kurtz}, M.~J. and {Murray}, S.~S.} (1999). ``Looking at 3,000,000

588:   Referencenes Without Growing Grey Hair,''

589: {\em {Bulletin of the American Astronomical Society}} 31, 1496

590:

591: \bibitem{Grant2000}

592: {Grant}, C.~S., {Accomazzi}, A., {Eichhorn}, G., {Kurtz}, M.~J., and {Murray},

593:   S.~S. (2000). ``The NASA Astrophysics Data System: Data holdings,''

594:  {\em Astronomy and Astrophysics Supplement} {\bf 143}, 111-135

595:

596: \bibitem{heringer1993dependency}

597: Heringer, H.~J. (1993). ``Dependency syntax -- basic ideas and the

598: classical model.''

599: In {\em Syntax - An International Handbook of Contemporary Research, volume 1}, J. Jacobs, A. von Stechow, W. Sternefeld, and T. Venneman (eds.),

600:   Walter de Gruyter, Berlin, New York, pp 298--316.

601:

602: \bibitem{Kurtz2000ADS}

603: {Kurtz}, M.~J., {Eichhorn}, G., {Accomazzi}, A., {Grant}, C.~S., {Murray},

604:   S.~S., and {Watson}, J.~M. (2000). ``The NASA Astrophysics Data

605:   System: Overview,'' {\em Astronomy and Astrophysics Supplement} {\bf

606:   143}, 41--59

607:

608: \bibitem{lawrence99digital}

609: Lawrence, S., Giles, C.~L., and Bollacker, K. (1999), ``Digital

610: Libraries and {Autonomous Citation Indexing},''

611: {\em IEEE Computer} {\bf 32(6)}, 67--71

612:

613: \bibitem{levenshtein}

614: Levenshtein, V.~I. (1966). ``Binary codes capable of correcting

615: deletions, insertions and reversals,''

616: {\em Soviet Physics Doklady} {\bf 10}, 707--710

617:

618: \bibitem{takasu2003erroneous}

619: Takasu, A. (2003). ``Bibliographic attribute extraction from erroneous

620: references based on a statistical model.'' In {\em Proceedings of the third ACM/IEEE-CS joint conference on

621:   Digital libraries}, pp 49--60

622:

623: \bibitem{vandesompel1999linking}

624: van de~Sompel, H.~V. and Hochstenbach, P. (1999). ``Reference Linking

625: in a Hybrid Library Environment,''

626: {\em D-Lib Magazine} 5(4)

627:

628: \end{thebibliography}

629: \end{document}

630: