0304:cs0304029/cs0304029

1: \documentclass[colacl]{article}%LaTeX 2e

2: \usepackage{colacl}                              % LaTeX 2e

3: \usepackage{times}                               % LaTeX 2e

4:

5:

6: %-------------------------------------------------------------------------

7: % take the % away on next line to produce the final camera-ready version

8: %\pagestyle{empty}

9: %-------------------------------------------------------------------------

10:

11: \title{An XML-based document suite}

12:

13: \author{Dietmar R\"osner \and Manuela Kunze\\

14: Otto-von-Guericke-Universit\"at Magdeburg\\Institut f\"ur Wissens-

15: und Sprachverarbeitung \\

16: P.O.box 4120, 39016 Magdeburg,

17: Germany\\(roesner,makunze)@iws.cs.uni-magdeburg.de\\}

18:

19: \begin{document}

20: \maketitle

21: \begin{abstract} We report about the current state of

22: development of a document suite and its applications.  This

23: collection of tools for the flexible and robust processing of

24: documents in German is based on the use of XML as unifying

25: formalism for encoding input and output data as well as process

26: information. It is organized in modules with limited

27: responsibilities that can easily be combined into pipelines to

28: solve complex tasks. Strong emphasis is laid on a number of

29: techniques to deal with lexical and conceptual gaps that are

30: typical when starting a new application. \end{abstract}

31:

32:

33: %\makeidpage

34:

35: %\type{project paper} \subject{document processing, XML,

36: %resources} \contact{Dietmar R\"osner} \conference{none}

37:

38:

39:

40:

41: \newtheorem{example}{Example }

42: %-------------------------------------------------------------------------

43:

44: \section*{Introduction}

45: We have designed and implemented the XDOC document suite as a

46: workbench for the flexible processing of electronically available

47: documents in German. We have decided to exploit XML

48: \cite{xml-standard} and its accompanying formalisms (e.g. XSLT

49: \cite{XSL}) and tools (e.g. xt \cite{clarkSite} ) as a unifying

50: framework. All modules in the XDOC system expect XML documents as

51: input and deliver their results in XML format.

52:

53: XML -- and ist precursor SGML -- offers a formalism to annotate

54: pieces of (natural language) texts. To be more precise: If a text

55: is (as a simple first approximation) seen as a sequence of

56: characters (alphabetic and \hyphenation{white-space} whitespace

57: characters) then XML allows to associate arbitrary markup with

58: arbitrary subsequences of {\em contiguous} characters.  Many

59: linguistic units of interest are represented by strings of

60: contiguous characters (e.g. words, phrases, clauses etc.). To use

61: XML to encode information about such a substring of a text

62: interpreted as a meaningful linguistic unit and to associate this

63: information directly with the occurrence of the unit in the text

64: is a straightforward idea. The basic idea is further backed by

65: XMLs demand that XML elements have to be properly nested. This is

66: fully concordant with standard linguistic practice: complex

67: structures are made up from simpler structures covering substrings

68: of the full string in a nested way.

69:

70: The end users of our applications are domain experts (e.g. medical

71: doctors, engineers, ...). They are interested in getting their

72: problems solved but they are typically neither interested nor

73: trained in computational linguistics. Therefore the barrier to

74: overcome before they can use a computational linguistics or text

75: technology system should be as low as possible.

76:

77: This experience has consequences for the design of the document

78: suite. The work in the XDOC project is guided by the following

79: design principles that have been abstracted from a number of

80: experiments and applications with "realistic" documents (i.a.

81: emails, abstracts of scientific papers, technical documentation,

82: ...):

83:

84: \begin{itemize}

85:   \item The tools shall be usable for `realistic' documents.

86:   \newline

87:   One aspect of `realistic' documents is that they typically contain

88: domain-specific tokens that are not directly covered by classical

89: lexical categories (like noun, verb, ...). Those tokens are

90: nevertheless often essential for the user of the document (e.g. an

91: enzyme descriptor like EC 4.1.1.17 for a biochemist).

92:   \item The tools shall be as robust as possible.

93: \\In general it can not be expected that lexicon information is

94: available for all tokens in a document. This is not only the case

95: for most tokens from `nonlexical' types -- like telephone numbers,

96: enzyme names, material codes, ... --, even for lexical types there

97: will always be `lexical gaps'. This may either be caused by

98: \hyphenation{neo-logisms} neologisms or simply by starting to

99: process documents from a new application domain with a new

100: sublanguage. In the latter case lexical items will typically be

101: missing in the lexicon (`lexical gap') and phrasal structures may

102: not or not adequately be covered by the grammar.

103:   \item The tools shall be usable independently but shall allow for

104: flexible combination and interoperability.

105:   \item The tools shall not only be usable by developers but as well by

106: domain experts without linguistic training.

107: \end{itemize}

108:

109:

110: Here again XML and XSLT play a major role: XSL stylesheets can be

111: exploited to allow different presentations of internal data and

112: results for different target groups; for end users the internals

113: are in many cases not helpful, whereas developers will need them

114: for debugging.

115:

116:

117: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

118: \newpage

119: The tools in the XDOC document suite can be grouped according to

120: their function:

121:

122: \begin{itemize}

123:   \item preprocessing

124:   \item structure detection

125:   \item POS tagging

126:   \item syntactic parsing

127:   \item semantic analysis

128:   \item tools for the specific application: e.g. information extraction

129: \end{itemize}

130:

131: In all tools the results of processing is encoded with XML tags

132: delimiting the respective piece of text. The information conveyed

133: by the tag name is enriched with XML attributes and their resp.

134: values.

135:

136: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

137: \section*{Preprocessing} Tools for preprocessing are used to

138: convert documents from a number of formats into the XML format

139: amenable for further processing. As a subtask this includes

140: treatment of special characters (e.g. for umlauts, apostrophes,

141: ...).

142:

143:

144:

145: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

146: \section*{Structure detection}

147:

148: We accept raw ASCII texts without any markup as input. In such

149: cases structure detection tries to uncover linguistic units (e.g.

150: sentences, titles, ...) as candidates for further analysis. A

151: major subtask is to identify the role of interpunction characters.

152:

153: If we have the structures in a text explicitly available this may

154: be exploited by subsequent linguistic processing. An example: For

155: a unit classified as title or subtitle you will accept an NP

156: whereas within a paragraph you will expect full sentences.

157:

158: In realistic texts even the detection of possible sentence

159: boundaries needs some care. A period character may not only be

160: used as a full stop but may as well be part of an abbreviation

161: (e.g. `z.B.' -- engl.: `e.g.' -- or `Dr.'), be contained in a

162: number (3.14), be used in an email address or in domain specific

163: tokens. The resources employed are special lexica (e.g. for

164: abbreviations) and finite automata for the reliable detection of

165: token from specialized non-lexical categories (e.g. enzyme names,

166: material codes, ...).

167:

168: These resources are used here primarily to identify those full

169: stop characters that function as sentence delimiters (tagged as

170: IP). In addition, the information about the function of strings

171: that include a period is tagged in the result (e.g. ABBR).

172:

173: \begin{example} results of structure detection

174: \scriptsize

175: \begin{verbatim}

176: Anwesend<IP>:</IP>

177: <ABBR>Univ.-Prof.</ABBR>

178: <ABBR>Dr.</ABBR><ABBR>med.</ABBR>Dieter Krause<IP>,</IP>

179: Direktor des Institutes fuer Rechtsmedizin

180: \end{verbatim}

181: \end{example}

182: \normalsize

183:

184: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

185:

186: \section*{POS tagging}

187:

188: To try to assign part-of-speech information to a token is not only

189: a preparatory step for parsing. The information gained about a

190: document by POS tagging and evaluating its results is valuable in

191: its own right. The ratio of token not classifiable by the POS

192: tagger to token classified may e.g. serve as an indication of the

193: degree of lexical coverage.

194:

195: In principle a number of approaches is usable for POS tagging

196: (e.g. \cite{brill:92}). We decided to avoid approaches based on

197: (supervised) learning from tagged corpora, since the cost for

198: creating the necessary training data are likely to be prohibitive

199: for our users (especially in specialized sublanguages).

200:

201: The approach chosen was to try to make best use of available

202: resources for German and to enhance them with additional

203: functionality. The tool chosen is not only used in POS tagging but

204: serves as a general morpho-syntactic component for German:

205: MORPHIX.

206:

207: The resources employed in XDOC's POS tagger are:

208:

209: - the lexicon and the inflectional analysis from the

210: morphosyntactic component MORPHIX

211:

212: - a number of heuristics (e.g. for the classification of token not

213: covered  in the lexicon)

214:

215:

216: For German the morphology component MORPHIX

217: \cite{finkler.neumann:88} has been developed in a number of

218: projects and is available in different realisations. This

219: component has the advantage that the closed class lexical items of

220: German as well as all irregular verbs are covered. The coverage of

221: open class lexical items is dependent on the amount of lexical

222: coding. The paradigms for e.g. verb conjugation and noun

223: declination are fully covered but to be able to analyze and

224: generate word forms their roots need to be included in the MORPHIX

225: lexicon.

226:

227: We exploit MORPHIX - in addition to its role in syntactic parsing

228: - for POS tagging as well. If a token in a German text can be

229: morphologically analysed with MORPHIX the resulting word class

230: categorisation is used as POS information.  Note that this

231: classification need not be unique. Since the tokens are analysed

232: in isolation multiple analyses are often the case. Some examples:

233: the token `der' may either be a determiner (with a number of

234: different combinations for the features case, number and gender)

235: or a relative pronoun, the token `liebe' may be either a verb or

236: an adjective (again with different feature combinations not

237: relevant for POS tagging).

238:

239: In addition since we do not expect extensive lexicon coding at the

240: beginning of an XDOC application some tokens will not get a

241: MORPHIX analysis. We then employ two techniques: We first try to

242: make use of heuristics that are based on aspects of the tokens

243: that can easily be detected with simple string analysis (e.g.

244: upper-/lowercase, endings, ...) and/or exploitation of the token

245: position relative to sentence boundaries (detected in the

246: structure detection module). If a heuristic yields a

247: classification the resulting POS class is added together with the

248: name of the employed heuristic (marked as feature SRC, cf. example

249: 3). If no heuristics are applicable we classify the token as

250: member of the class unknown (tagged with XXX).

251:

252: To keep the POS tagger fast and simple the disambiguation between

253: multiple POS classes for a token and the derivation of a possible

254: POS class from context for an unknown token are postponed to

255: syntactic processing. This is in line with our general principle

256: to accept results with overgeneration when a module is applied in

257: isolation (here: POS tagging) and to rely on filtering ambiguous

258: results in a later stage of processing (here: exploiting the

259: syntactic context).

260:

261: \begin{example} domain-specific tagging

262:

263: \scriptsize

264: \begin{verbatim}

265:

266: <PRODUCT Method="Sandguss" Material="CC333G">

267:     <N>Gussstueck</N>

268:     <NORM>

269:          <N>EN</N>

270:          <NR>1982</NR>

271:     </NORM>

272:     <IP>-</IP>

273:     <MAT-ID>CC333G</MAT-ID>

274:     <IP>-</IP>

275:     <METHODE>GS</METHODE>

276:     <IP>-</IP>

277:     <MODELLNR>XXXX</MODELLNR>

278: </PRODUCT>

279: \end{verbatim}

280: \end{example}

281: \normalsize

282:

283: The example above is the result of tagging a domain-specific

284: identifier. The token is annotated as a {\em PRODUCT} with

285: description of the used method and material. It is a typical token

286: in the domain of casting technology.

287: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

288: \section*{Syntactic parsing}

289:

290: For syntactic parsing we apply a chart parser based on context

291: free grammar rules augmented with feature structures.

292:

293: Again robustness is achieved by allowing as input elements:

294: \begin{itemize}

295:   \item multiple POS classes,

296:   \item unknown classes of open world tokens and

297:   \item tokens with POS class, but without or only partial feature

298: information.

299: \end{itemize}

300:

301:

302: \begin{example} unknown token classified as noun with heuristics

303: \scriptsize

304: \begin{verbatim}

305: <NP TYPE="COMPLEX" RULE="NPC3" GEN="FEM"

306:         NUM="PL" CAS="_">

307:   <NP TYPE="FULL" RULE="NP1" CAS="_"

308:             NUM="PL" GEN="FEM">

309:        <N SRC="UNG">Blutanhaftungen</N>

310:   </NP>

311:   <PP CAS="DAT">

312:     <PRP CAS="DAT">an</PRP>

313:     <NP TYPE="FULL" RULE="NP2" CAS="DAT"

314:             NUM="SG" GEN="FEM">

315:       <DETD>der</DETD>

316:       <N SRC="UC1">Gekroesewurzel</N>

317:     </NP>

318:   </PP>

319: </NP>

320: \end{verbatim}

321: \end{example}

322: \normalsize

323:

324: The latter case results from some heuristics in POS

325: tagging that allow to assume e.g. the class noun for a token but

326: do not suffice to detect its full paradigm from the token (note

327: that there are ca two dozen different morphosyntactic paradigms

328: for noun declination in German).

329:

330: For a given input the parser attempts to find all complete

331: analyses that cover the input. If no such complete analysis is

332: achievable it is attempted to combine maximal partial results into

333: structures covering the whole input.

334:

335: A successful analysis may be based on an assumption about the word

336: class of an initially unclassified token (tagged XXX). This is

337: indicated in the parsing result (feature AS) and can be exploited

338: for learning such classifications from contextual constraints.  In

339: a similar way the successful combination from known feature values

340: from closed class items (e.g. determiners, prepositions) with

341: underspecified features in agreement constraints allows the

342: determination of paradigm information from successfully processed

343: occurrences. In example 4 features of the unknown word

344: "Mundhoehle" could be derived from the features of the determiner

345: within the PP.

346:

347: \begin{example} unknown token classified as adjective

348: and features derived through contextual constraints

349: \scriptsize

350: \begin{verbatim}

351: <NP TYPE="COMPLEX" RULE="NPC3" GEN="MAS" NUM="SG"

352:     CAS="NOM">

353:   <NP TYPE="FULL" RULE="NP3" CAS="NOM" NUM="SG"

354:         GEN="MAS">

355:     <DETI>kein</DETI>

356:     <XXX AS="ADJ">ungehoeriger</XXX>

357:     <N>Inhalt</N>

358:   </NP>

359:   <PP CAS="DAT">

360:     <PRP CAS="DAT">in</PRP>

361:     <NP TYPE="FULL" RULE="NP2" CAS="DAT" NUM="SG"

362:         GEN="FEM">

363:       <DETD>der</DETD>

364:       <N SRC="UC1">Mundhoehle</N>

365:     </NP>

366:   </PP>

367: </NP>"

368: \end{verbatim}

369: \end{example}

370: \normalsize The grammar used in syntactic parsing is organized in

371: a modular way that allows to add or remove groups of rules. This

372: is exploited when the sublanguage of a domain contains linguistic

373: structures that are unusual or even ungrammatical in standard

374: German.

375:

376: \begin{example}Excerpt from syntactic analysis

377: \scriptsize

378: \begin{verbatim}

379: <PP CAS="AKK">

380:   <PRP CAS="AKK">durch</PRP>

381:     <NP TYPE="COMPLEX" RULE="NPC1" GEN="NTR" NUM="SG"

382:                                  CAS="AKK">

383:       <NP TYPE="FULL" RULE="NP1" CAS="AKK" NUM="SG"

384:                                  GEN="NTR">

385:         <N>Schaffen</N>

386:       </NP>

387:       <NP TYPE="FULL" RULE="NP2" CAS="GEN" NUM="SG"

388:                                   GEN="MAS">

389:          <DETD>des</DETD>

390:          <N>Zusammenhalts</N>

391:       </NP>

392:     </NP>

393: </PP>

394: \end{verbatim}

395: \end{example}

396:

397: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

398: \newpage

399: \section*{Semantic analysis}

400:

401: At the time of writing semantic analysis uses three methods:

402:

403: \subsection*{Semantic tagging}

404:

405: For semantic tagging we apply a semantic lexicon. This lexicon

406: contains the semantic interpretation of a token and a case frame

407: combined with the syntactic valence requirements. Similar to POS

408: tagging the tokens are annotated with their meaning and a

409: classification in semantic categories like e.g. concepts and

410: relations. Again it is possible, that the classification of a

411: token in isolation is not unique. Multiple classification can be

412: resolved through the following analysis of the case frame and

413: through its combination with the syntactic structure which

414: includes the token.

415:

416: \subsection*{Analysis of case frames}

417:

418: By the case frame analysis of a token we obtain details about the

419: type of recognized concepts (resolving multiple interpretations)

420: and possible relations to other concepts. The results are tagged

421: with XML tags. The following example describes the DTD for the

422: annotation of the results of case frame analysis.

423:

424: \begin{example} DTD for the annotation by case frame analysis

425: \scriptsize

426: \begin{verbatim}

427:     <!ELEMENT CONCEPTS (CONCEPT)*>

428:

429:     <!ELEMENT CONCEPT (WORD, DESC, SLOTS?)>

430:     <!ATTLIST CONCEPT TYPE CDATA #REQUIRED>

431:

432:     <!ELEMENT WORD (#PCDATA)>

433:     <!ELEMENT DESC (#PCDATA)>

434:     <!ELEMENT SLOTS (RELATION+)>

435:

436:     <!ELEMENT RELATION (ASSIGN_TO, FORM, CONTENT)>

437:     <!ATTLIST RELATION TYPE CDATA #REQUIRED>

438:

439:     <!ELEMENT ASSIGN_TO (#PCDATA)>

440:     <!ELEMENT FORM (#PCDATA)>

441:     <!ELEMENT CONTENT (#PCDATA)>

442: \end{verbatim}

443: \end{example}

444: \normalsize

445:

446: We use attributes to show the description of the concepts and we

447: can annotate the relevant relations between the concepts through

448: nested tags (e.g. the tag \emph{SLOTS}).

449:

450: \begin{example} Excerpt from case frame analysis

451: \scriptsize

452: \begin{verbatim}

453:  <CONCEPT TYPE=Prozess>

454:     <WORD>Fertigen</WORD>

455:     <DESC>Schaffung von etwas</DESC>

456:     <SLOTS>

457:         <RELATION>

458:         <RESULT FORM="N(gen, fak) P(akk, fak, von)">

459:                 fester Koerper</RESULT>

460:         <SOURCE FORM="P(dat, fak, aus)">aus formlosem

461:                 Stoff </SOURCE>

462:         <INSTRUMENT FORM="P(akk, fak, durch)">durch

463:                 Schaffen des Zusammenhalts</INSTRUMENT>

464:         </RELATION>

465:     </SLOTS>

466: </CONCEPT>

467: \end{verbatim}

468: \end{example}

469: \normalsize The example above is part of the result of the

470: analysis of the German phrase: {\em Fertigen fester Koerper aus

471: formlosem Stoff durch Schaffen des Zusammenhalts}\footnote{In

472: English: production of solid objects from formless matter by

473: creating cohesion}. The token {\em Fertigen} is classified as {\em

474: process} with the relations {\em source, result} and {\em

475: instrument}. The following phrases (noun phrases and preposition

476: phrases) are checked to make sure that they are assignable to the

477: relation requirements (semantic and syntactic) of the token {\em

478: Fertigen}.

479:

480: %\begin{figure}[hbt]

481:  %   \epsfxsize=8cm

482:   %  \epsffile{casus.eps}

483:   %  \caption{\label{xsl-result} Presentation of the Semantic Results}

484: %\end{figure}

485:

486:

487: \subsection*{Semantic interpretation of the syntactic

488: structure}

489:

490: An other step to analyze the relations between tokens can be the

491: interpretation of the syntactic structure of a phrase or sentences

492: respectively. We exploit the syntactic structure of the

493: sublanguage to extract the relation between several tokens. For

494: example a typical phrase from an autopsy report: {\em Leber

495: dunkelrot.}\footnote{In English: Liver dark red.}

496:

497: From semantic tagging we obtain the following information:

498: \begin{example} results of semantic tagging

499: \scriptsize

500: \begin{verbatim}

501: <CONCEPT TYPE="organ">Leber</CONCEPT>

502: <PROPERTY TYPE="color">dunkelrot</PROPERTY>

503: <XXX>.</XXX>

504: \end{verbatim}

505: \end{example}

506: \normalsize

507:

508: In this example we can extract the relation "has-color" between

509: the tokens {\em Leber} and {\em dunkelrot}. This is an example of

510: a simple semantic relation. Other semantic relations can be

511: described through more complex variations. In these cases we must

512: consider linguistic structures like modifiers (e.g. \emph{etwas}),

513: negations (e.g. \emph{nicht}), coordinations (e.g.

514: \emph{Beckengeruest unversehrt und fest gefuegt}) and noun groups

515: (e.g. \emph{Bauchteil der grossen

516: \hyphenation{Koer-per-schlag-ader} Koerperschlagader}).

517:

518:

519: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

520: \section*{Current state and future work}

521:

522: The XDOC document workbench is currently employed in a number of

523: applications. These include:

524:

525: \begin{itemize}

526:

527: \item knowledge acquisition from technical

528: documentation about casting technology

529:

530: \item extraction of company profiles from WWW pages

531:

532: \item analysis of autopsy protocols

533:

534: \end{itemize}

535:

536: The latter application is part of a joint project with the

537: institute for forensic medicine of our university. The medical

538: doctors there are interested in tools that help them to exploit

539: their huge collection of several thousand autopsy protocols for

540: their research interests. The confrontation with this corpus has

541: stimulated experiments with `bootstrapping techniques' for lexicon

542: and ontology creation.

543:

544: The core idea is the following:

545:

546: When you are confronted with a new corpus from a new domain, try

547: to find linguistic structures in the text that are easy to detect

548: automatically and that allow to classify unknown terms in a robust

549: manner both syntactically as well as on the knowledge level. Take

550: the results from a run of these simple but robust heuristics as an

551: initial version of a domain dependent lexicon and ontology.

552: Exploit these initial resources to extend the processing to more

553: complicated linguistic structures in order to detect and classify

554: more terms of interest automatically.

555:

556: An example: In the sublanguage of autopsy protocols (in German) a

557: very telegrammatic style is dominant. Condensed and compact

558: structures like the following are very frequent:

559:

560: \begin{quotation}

561: \noindent

562: \emph{Harnblase leer.}\newline \emph{Harnleiter frei.}

563: \newline \emph{Nierenoberflaeche glatt.}

564: \newline \emph{Vorsteherdruese altersentsprechend.} \newline \dots

565: \end{quotation}

566:

567: These structures can be abstracted syntactically as

568: $<$Noun$>$$<$Adjective$>$$<$Fullstop$>$ and semantically as

569: reporting a finding in the form $<$Anatomic-entity$>$ has

570: $<$Attribute-value$>$ and they are easily detectable

571: \cite{roesner02}.

572:

573: In our experiments we have exploited this characteristic of the

574: corpus extensively to automatically deduce an initial lexicon

575: (with nouns and adjectives) and ontology (with concepts for

576: anatomic regions or organs and their respective features and

577: values). The feature values were further exploited to cluster the

578: concept candidates into groups according to their feature values.

579: In this way container like entities with feature values like

580: `leer' (empty) or `gefuellt' (full) can be distinguished from e.g.

581: entities of surface type with feature values like `glatt'

582: (smooth).

583:

584:

585: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

586: \section*{Related Work}

587: The work in XDOC has been inspired by a number of precursory

588: projects:

589:

590: In GATE \cite{GATESite,GATE} the idea of piping simple modules in

591: order to achieve complex functionality has been applied to NLP

592: with such a rigid architecture for the first time. The project LT

593: XML has been pioneering XML as a data format for linguistic

594: processing.

595:

596: Both GATE and LT XML \cite{ltxml99} were employed for processing

597: English texts. SMES \cite{Neumann97} has been an attempt to

598: develop a toolbox for message extraction from German texts. A

599: disadvantage of SMES that is avoided in XDOC is the lack of a

600: uniform encoding formalism, in other words, users are confronted

601: with different encodings and formats in each module.

602:

603: \section*{System availability}

604:

605: Major components of XDOC are made publicly accessible for testing

606: and experiments under the URL:

607:

608:  {\bf http://lima.cs.uni-magdeburg.de:8000/ }

609:

610:

611: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

612: \section*{Summary}

613:

614: We have reported about the current state of the XDOC document

615: suite. This collection of tools for the flexible and robust

616: processing of documents in German is based on the use of XML as

617: unifying formalism for encoding input and output data as well as

618: process information. It is organized in modules with limited

619: responsibilities that can easily be combined into pipelines to

620: solve complex tasks. Strong emphasis is laid on a number of

621: techniques to deal with lexical and conceptual gaps and to

622: guarantee robust systems behaviour without the need for a priori

623: investment in resource creation by users. When end users are first

624: confronted with the system they typically are interested in quick

625: progress in their application but should not be forced to be

626: engaged e.g. in lexicon build up and grammar debugging, before

627: being able to start with experiments. This is not to say that

628: creation of specialized lexicons is unnecessary. There is a strong

629: correlation between prior investment in resources and improved

630: performance and higher quality of results. Our experience shows

631: that initial results in experiments are a good motivation for

632: subsequent efforts of users and investment in extended and

633: improved linguistic resources but that a priori costs may be

634: blocking the willingness of users to get really involved.

635:

636:

637:

638: %\nocite{ex1,ex2}

639: \bibliographystyle{acl}

640: \bibliography{coling}

641:

642: \end{document}

643: