1: \documentclass[colacl]{article}%LaTeX 2e
2: \usepackage{colacl} % LaTeX 2e
3: \usepackage{times} % LaTeX 2e
4:
5:
6: %-------------------------------------------------------------------------
7: % take the % away on next line to produce the final camera-ready version
8: %\pagestyle{empty}
9: %-------------------------------------------------------------------------
10:
11: \title{An XML-based document suite}
12:
13: \author{Dietmar R\"osner \and Manuela Kunze\\
14: Otto-von-Guericke-Universit\"at Magdeburg\\Institut f\"ur Wissens-
15: und Sprachverarbeitung \\
16: P.O.box 4120, 39016 Magdeburg,
17: Germany\\(roesner,makunze)@iws.cs.uni-magdeburg.de\\}
18:
19: \begin{document}
20: \maketitle
21: \begin{abstract} We report about the current state of
22: development of a document suite and its applications. This
23: collection of tools for the flexible and robust processing of
24: documents in German is based on the use of XML as unifying
25: formalism for encoding input and output data as well as process
26: information. It is organized in modules with limited
27: responsibilities that can easily be combined into pipelines to
28: solve complex tasks. Strong emphasis is laid on a number of
29: techniques to deal with lexical and conceptual gaps that are
30: typical when starting a new application. \end{abstract}
31:
32:
33: %\makeidpage
34:
35: %\type{project paper} \subject{document processing, XML,
36: %resources} \contact{Dietmar R\"osner} \conference{none}
37:
38:
39:
40:
41: \newtheorem{example}{Example }
42: %-------------------------------------------------------------------------
43:
44: \section*{Introduction}
45: We have designed and implemented the XDOC document suite as a
46: workbench for the flexible processing of electronically available
47: documents in German. We have decided to exploit XML
48: \cite{xml-standard} and its accompanying formalisms (e.g. XSLT
49: \cite{XSL}) and tools (e.g. xt \cite{clarkSite} ) as a unifying
50: framework. All modules in the XDOC system expect XML documents as
51: input and deliver their results in XML format.
52:
53: XML -- and ist precursor SGML -- offers a formalism to annotate
54: pieces of (natural language) texts. To be more precise: If a text
55: is (as a simple first approximation) seen as a sequence of
56: characters (alphabetic and \hyphenation{white-space} whitespace
57: characters) then XML allows to associate arbitrary markup with
58: arbitrary subsequences of {\em contiguous} characters. Many
59: linguistic units of interest are represented by strings of
60: contiguous characters (e.g. words, phrases, clauses etc.). To use
61: XML to encode information about such a substring of a text
62: interpreted as a meaningful linguistic unit and to associate this
63: information directly with the occurrence of the unit in the text
64: is a straightforward idea. The basic idea is further backed by
65: XMLs demand that XML elements have to be properly nested. This is
66: fully concordant with standard linguistic practice: complex
67: structures are made up from simpler structures covering substrings
68: of the full string in a nested way.
69:
70: The end users of our applications are domain experts (e.g. medical
71: doctors, engineers, ...). They are interested in getting their
72: problems solved but they are typically neither interested nor
73: trained in computational linguistics. Therefore the barrier to
74: overcome before they can use a computational linguistics or text
75: technology system should be as low as possible.
76:
77: This experience has consequences for the design of the document
78: suite. The work in the XDOC project is guided by the following
79: design principles that have been abstracted from a number of
80: experiments and applications with "realistic" documents (i.a.
81: emails, abstracts of scientific papers, technical documentation,
82: ...):
83:
84: \begin{itemize}
85: \item The tools shall be usable for `realistic' documents.
86: \newline
87: One aspect of `realistic' documents is that they typically contain
88: domain-specific tokens that are not directly covered by classical
89: lexical categories (like noun, verb, ...). Those tokens are
90: nevertheless often essential for the user of the document (e.g. an
91: enzyme descriptor like EC 4.1.1.17 for a biochemist).
92: \item The tools shall be as robust as possible.
93: \\In general it can not be expected that lexicon information is
94: available for all tokens in a document. This is not only the case
95: for most tokens from `nonlexical' types -- like telephone numbers,
96: enzyme names, material codes, ... --, even for lexical types there
97: will always be `lexical gaps'. This may either be caused by
98: \hyphenation{neo-logisms} neologisms or simply by starting to
99: process documents from a new application domain with a new
100: sublanguage. In the latter case lexical items will typically be
101: missing in the lexicon (`lexical gap') and phrasal structures may
102: not or not adequately be covered by the grammar.
103: \item The tools shall be usable independently but shall allow for
104: flexible combination and interoperability.
105: \item The tools shall not only be usable by developers but as well by
106: domain experts without linguistic training.
107: \end{itemize}
108:
109:
110: Here again XML and XSLT play a major role: XSL stylesheets can be
111: exploited to allow different presentations of internal data and
112: results for different target groups; for end users the internals
113: are in many cases not helpful, whereas developers will need them
114: for debugging.
115:
116:
117: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
118: \newpage
119: The tools in the XDOC document suite can be grouped according to
120: their function:
121:
122: \begin{itemize}
123: \item preprocessing
124: \item structure detection
125: \item POS tagging
126: \item syntactic parsing
127: \item semantic analysis
128: \item tools for the specific application: e.g. information extraction
129: \end{itemize}
130:
131: In all tools the results of processing is encoded with XML tags
132: delimiting the respective piece of text. The information conveyed
133: by the tag name is enriched with XML attributes and their resp.
134: values.
135:
136: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
137: \section*{Preprocessing} Tools for preprocessing are used to
138: convert documents from a number of formats into the XML format
139: amenable for further processing. As a subtask this includes
140: treatment of special characters (e.g. for umlauts, apostrophes,
141: ...).
142:
143:
144:
145: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
146: \section*{Structure detection}
147:
148: We accept raw ASCII texts without any markup as input. In such
149: cases structure detection tries to uncover linguistic units (e.g.
150: sentences, titles, ...) as candidates for further analysis. A
151: major subtask is to identify the role of interpunction characters.
152:
153: If we have the structures in a text explicitly available this may
154: be exploited by subsequent linguistic processing. An example: For
155: a unit classified as title or subtitle you will accept an NP
156: whereas within a paragraph you will expect full sentences.
157:
158: In realistic texts even the detection of possible sentence
159: boundaries needs some care. A period character may not only be
160: used as a full stop but may as well be part of an abbreviation
161: (e.g. `z.B.' -- engl.: `e.g.' -- or `Dr.'), be contained in a
162: number (3.14), be used in an email address or in domain specific
163: tokens. The resources employed are special lexica (e.g. for
164: abbreviations) and finite automata for the reliable detection of
165: token from specialized non-lexical categories (e.g. enzyme names,
166: material codes, ...).
167:
168: These resources are used here primarily to identify those full
169: stop characters that function as sentence delimiters (tagged as
170: IP). In addition, the information about the function of strings
171: that include a period is tagged in the result (e.g. ABBR).
172:
173: \begin{example} results of structure detection
174: \scriptsize
175: \begin{verbatim}
176: Anwesend<IP>:</IP>
177: <ABBR>Univ.-Prof.</ABBR>
178: <ABBR>Dr.</ABBR><ABBR>med.</ABBR>Dieter Krause<IP>,</IP>
179: Direktor des Institutes fuer Rechtsmedizin
180: \end{verbatim}
181: \end{example}
182: \normalsize
183:
184: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
185:
186: \section*{POS tagging}
187:
188: To try to assign part-of-speech information to a token is not only
189: a preparatory step for parsing. The information gained about a
190: document by POS tagging and evaluating its results is valuable in
191: its own right. The ratio of token not classifiable by the POS
192: tagger to token classified may e.g. serve as an indication of the
193: degree of lexical coverage.
194:
195: In principle a number of approaches is usable for POS tagging
196: (e.g. \cite{brill:92}). We decided to avoid approaches based on
197: (supervised) learning from tagged corpora, since the cost for
198: creating the necessary training data are likely to be prohibitive
199: for our users (especially in specialized sublanguages).
200:
201: The approach chosen was to try to make best use of available
202: resources for German and to enhance them with additional
203: functionality. The tool chosen is not only used in POS tagging but
204: serves as a general morpho-syntactic component for German:
205: MORPHIX.
206:
207: The resources employed in XDOC's POS tagger are:
208:
209: - the lexicon and the inflectional analysis from the
210: morphosyntactic component MORPHIX
211:
212: - a number of heuristics (e.g. for the classification of token not
213: covered in the lexicon)
214:
215:
216: For German the morphology component MORPHIX
217: \cite{finkler.neumann:88} has been developed in a number of
218: projects and is available in different realisations. This
219: component has the advantage that the closed class lexical items of
220: German as well as all irregular verbs are covered. The coverage of
221: open class lexical items is dependent on the amount of lexical
222: coding. The paradigms for e.g. verb conjugation and noun
223: declination are fully covered but to be able to analyze and
224: generate word forms their roots need to be included in the MORPHIX
225: lexicon.
226:
227: We exploit MORPHIX - in addition to its role in syntactic parsing
228: - for POS tagging as well. If a token in a German text can be
229: morphologically analysed with MORPHIX the resulting word class
230: categorisation is used as POS information. Note that this
231: classification need not be unique. Since the tokens are analysed
232: in isolation multiple analyses are often the case. Some examples:
233: the token `der' may either be a determiner (with a number of
234: different combinations for the features case, number and gender)
235: or a relative pronoun, the token `liebe' may be either a verb or
236: an adjective (again with different feature combinations not
237: relevant for POS tagging).
238:
239: In addition since we do not expect extensive lexicon coding at the
240: beginning of an XDOC application some tokens will not get a
241: MORPHIX analysis. We then employ two techniques: We first try to
242: make use of heuristics that are based on aspects of the tokens
243: that can easily be detected with simple string analysis (e.g.
244: upper-/lowercase, endings, ...) and/or exploitation of the token
245: position relative to sentence boundaries (detected in the
246: structure detection module). If a heuristic yields a
247: classification the resulting POS class is added together with the
248: name of the employed heuristic (marked as feature SRC, cf. example
249: 3). If no heuristics are applicable we classify the token as
250: member of the class unknown (tagged with XXX).
251:
252: To keep the POS tagger fast and simple the disambiguation between
253: multiple POS classes for a token and the derivation of a possible
254: POS class from context for an unknown token are postponed to
255: syntactic processing. This is in line with our general principle
256: to accept results with overgeneration when a module is applied in
257: isolation (here: POS tagging) and to rely on filtering ambiguous
258: results in a later stage of processing (here: exploiting the
259: syntactic context).
260:
261: \begin{example} domain-specific tagging
262:
263: \scriptsize
264: \begin{verbatim}
265:
266: <PRODUCT Method="Sandguss" Material="CC333G">
267: <N>Gussstueck</N>
268: <NORM>
269: <N>EN</N>
270: <NR>1982</NR>
271: </NORM>
272: <IP>-</IP>
273: <MAT-ID>CC333G</MAT-ID>
274: <IP>-</IP>
275: <METHODE>GS</METHODE>
276: <IP>-</IP>
277: <MODELLNR>XXXX</MODELLNR>
278: </PRODUCT>
279: \end{verbatim}
280: \end{example}
281: \normalsize
282:
283: The example above is the result of tagging a domain-specific
284: identifier. The token is annotated as a {\em PRODUCT} with
285: description of the used method and material. It is a typical token
286: in the domain of casting technology.
287: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
288: \section*{Syntactic parsing}
289:
290: For syntactic parsing we apply a chart parser based on context
291: free grammar rules augmented with feature structures.
292:
293: Again robustness is achieved by allowing as input elements:
294: \begin{itemize}
295: \item multiple POS classes,
296: \item unknown classes of open world tokens and
297: \item tokens with POS class, but without or only partial feature
298: information.
299: \end{itemize}
300:
301:
302: \begin{example} unknown token classified as noun with heuristics
303: \scriptsize
304: \begin{verbatim}
305: <NP TYPE="COMPLEX" RULE="NPC3" GEN="FEM"
306: NUM="PL" CAS="_">
307: <NP TYPE="FULL" RULE="NP1" CAS="_"
308: NUM="PL" GEN="FEM">
309: <N SRC="UNG">Blutanhaftungen</N>
310: </NP>
311: <PP CAS="DAT">
312: <PRP CAS="DAT">an</PRP>
313: <NP TYPE="FULL" RULE="NP2" CAS="DAT"
314: NUM="SG" GEN="FEM">
315: <DETD>der</DETD>
316: <N SRC="UC1">Gekroesewurzel</N>
317: </NP>
318: </PP>
319: </NP>
320: \end{verbatim}
321: \end{example}
322: \normalsize
323:
324: The latter case results from some heuristics in POS
325: tagging that allow to assume e.g. the class noun for a token but
326: do not suffice to detect its full paradigm from the token (note
327: that there are ca two dozen different morphosyntactic paradigms
328: for noun declination in German).
329:
330: For a given input the parser attempts to find all complete
331: analyses that cover the input. If no such complete analysis is
332: achievable it is attempted to combine maximal partial results into
333: structures covering the whole input.
334:
335: A successful analysis may be based on an assumption about the word
336: class of an initially unclassified token (tagged XXX). This is
337: indicated in the parsing result (feature AS) and can be exploited
338: for learning such classifications from contextual constraints. In
339: a similar way the successful combination from known feature values
340: from closed class items (e.g. determiners, prepositions) with
341: underspecified features in agreement constraints allows the
342: determination of paradigm information from successfully processed
343: occurrences. In example 4 features of the unknown word
344: "Mundhoehle" could be derived from the features of the determiner
345: within the PP.
346:
347: \begin{example} unknown token classified as adjective
348: and features derived through contextual constraints
349: \scriptsize
350: \begin{verbatim}
351: <NP TYPE="COMPLEX" RULE="NPC3" GEN="MAS" NUM="SG"
352: CAS="NOM">
353: <NP TYPE="FULL" RULE="NP3" CAS="NOM" NUM="SG"
354: GEN="MAS">
355: <DETI>kein</DETI>
356: <XXX AS="ADJ">ungehoeriger</XXX>
357: <N>Inhalt</N>
358: </NP>
359: <PP CAS="DAT">
360: <PRP CAS="DAT">in</PRP>
361: <NP TYPE="FULL" RULE="NP2" CAS="DAT" NUM="SG"
362: GEN="FEM">
363: <DETD>der</DETD>
364: <N SRC="UC1">Mundhoehle</N>
365: </NP>
366: </PP>
367: </NP>"
368: \end{verbatim}
369: \end{example}
370: \normalsize The grammar used in syntactic parsing is organized in
371: a modular way that allows to add or remove groups of rules. This
372: is exploited when the sublanguage of a domain contains linguistic
373: structures that are unusual or even ungrammatical in standard
374: German.
375:
376: \begin{example}Excerpt from syntactic analysis
377: \scriptsize
378: \begin{verbatim}
379: <PP CAS="AKK">
380: <PRP CAS="AKK">durch</PRP>
381: <NP TYPE="COMPLEX" RULE="NPC1" GEN="NTR" NUM="SG"
382: CAS="AKK">
383: <NP TYPE="FULL" RULE="NP1" CAS="AKK" NUM="SG"
384: GEN="NTR">
385: <N>Schaffen</N>
386: </NP>
387: <NP TYPE="FULL" RULE="NP2" CAS="GEN" NUM="SG"
388: GEN="MAS">
389: <DETD>des</DETD>
390: <N>Zusammenhalts</N>
391: </NP>
392: </NP>
393: </PP>
394: \end{verbatim}
395: \end{example}
396:
397: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
398: \newpage
399: \section*{Semantic analysis}
400:
401: At the time of writing semantic analysis uses three methods:
402:
403: \subsection*{Semantic tagging}
404:
405: For semantic tagging we apply a semantic lexicon. This lexicon
406: contains the semantic interpretation of a token and a case frame
407: combined with the syntactic valence requirements. Similar to POS
408: tagging the tokens are annotated with their meaning and a
409: classification in semantic categories like e.g. concepts and
410: relations. Again it is possible, that the classification of a
411: token in isolation is not unique. Multiple classification can be
412: resolved through the following analysis of the case frame and
413: through its combination with the syntactic structure which
414: includes the token.
415:
416: \subsection*{Analysis of case frames}
417:
418: By the case frame analysis of a token we obtain details about the
419: type of recognized concepts (resolving multiple interpretations)
420: and possible relations to other concepts. The results are tagged
421: with XML tags. The following example describes the DTD for the
422: annotation of the results of case frame analysis.
423:
424: \begin{example} DTD for the annotation by case frame analysis
425: \scriptsize
426: \begin{verbatim}
427: <!ELEMENT CONCEPTS (CONCEPT)*>
428:
429: <!ELEMENT CONCEPT (WORD, DESC, SLOTS?)>
430: <!ATTLIST CONCEPT TYPE CDATA #REQUIRED>
431:
432: <!ELEMENT WORD (#PCDATA)>
433: <!ELEMENT DESC (#PCDATA)>
434: <!ELEMENT SLOTS (RELATION+)>
435:
436: <!ELEMENT RELATION (ASSIGN_TO, FORM, CONTENT)>
437: <!ATTLIST RELATION TYPE CDATA #REQUIRED>
438:
439: <!ELEMENT ASSIGN_TO (#PCDATA)>
440: <!ELEMENT FORM (#PCDATA)>
441: <!ELEMENT CONTENT (#PCDATA)>
442: \end{verbatim}
443: \end{example}
444: \normalsize
445:
446: We use attributes to show the description of the concepts and we
447: can annotate the relevant relations between the concepts through
448: nested tags (e.g. the tag \emph{SLOTS}).
449:
450: \begin{example} Excerpt from case frame analysis
451: \scriptsize
452: \begin{verbatim}
453: <CONCEPT TYPE=Prozess>
454: <WORD>Fertigen</WORD>
455: <DESC>Schaffung von etwas</DESC>
456: <SLOTS>
457: <RELATION>
458: <RESULT FORM="N(gen, fak) P(akk, fak, von)">
459: fester Koerper</RESULT>
460: <SOURCE FORM="P(dat, fak, aus)">aus formlosem
461: Stoff </SOURCE>
462: <INSTRUMENT FORM="P(akk, fak, durch)">durch
463: Schaffen des Zusammenhalts</INSTRUMENT>
464: </RELATION>
465: </SLOTS>
466: </CONCEPT>
467: \end{verbatim}
468: \end{example}
469: \normalsize The example above is part of the result of the
470: analysis of the German phrase: {\em Fertigen fester Koerper aus
471: formlosem Stoff durch Schaffen des Zusammenhalts}\footnote{In
472: English: production of solid objects from formless matter by
473: creating cohesion}. The token {\em Fertigen} is classified as {\em
474: process} with the relations {\em source, result} and {\em
475: instrument}. The following phrases (noun phrases and preposition
476: phrases) are checked to make sure that they are assignable to the
477: relation requirements (semantic and syntactic) of the token {\em
478: Fertigen}.
479:
480: %\begin{figure}[hbt]
481: % \epsfxsize=8cm
482: % \epsffile{casus.eps}
483: % \caption{\label{xsl-result} Presentation of the Semantic Results}
484: %\end{figure}
485:
486:
487: \subsection*{Semantic interpretation of the syntactic
488: structure}
489:
490: An other step to analyze the relations between tokens can be the
491: interpretation of the syntactic structure of a phrase or sentences
492: respectively. We exploit the syntactic structure of the
493: sublanguage to extract the relation between several tokens. For
494: example a typical phrase from an autopsy report: {\em Leber
495: dunkelrot.}\footnote{In English: Liver dark red.}
496:
497: From semantic tagging we obtain the following information:
498: \begin{example} results of semantic tagging
499: \scriptsize
500: \begin{verbatim}
501: <CONCEPT TYPE="organ">Leber</CONCEPT>
502: <PROPERTY TYPE="color">dunkelrot</PROPERTY>
503: <XXX>.</XXX>
504: \end{verbatim}
505: \end{example}
506: \normalsize
507:
508: In this example we can extract the relation "has-color" between
509: the tokens {\em Leber} and {\em dunkelrot}. This is an example of
510: a simple semantic relation. Other semantic relations can be
511: described through more complex variations. In these cases we must
512: consider linguistic structures like modifiers (e.g. \emph{etwas}),
513: negations (e.g. \emph{nicht}), coordinations (e.g.
514: \emph{Beckengeruest unversehrt und fest gefuegt}) and noun groups
515: (e.g. \emph{Bauchteil der grossen
516: \hyphenation{Koer-per-schlag-ader} Koerperschlagader}).
517:
518:
519: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
520: \section*{Current state and future work}
521:
522: The XDOC document workbench is currently employed in a number of
523: applications. These include:
524:
525: \begin{itemize}
526:
527: \item knowledge acquisition from technical
528: documentation about casting technology
529:
530: \item extraction of company profiles from WWW pages
531:
532: \item analysis of autopsy protocols
533:
534: \end{itemize}
535:
536: The latter application is part of a joint project with the
537: institute for forensic medicine of our university. The medical
538: doctors there are interested in tools that help them to exploit
539: their huge collection of several thousand autopsy protocols for
540: their research interests. The confrontation with this corpus has
541: stimulated experiments with `bootstrapping techniques' for lexicon
542: and ontology creation.
543:
544: The core idea is the following:
545:
546: When you are confronted with a new corpus from a new domain, try
547: to find linguistic structures in the text that are easy to detect
548: automatically and that allow to classify unknown terms in a robust
549: manner both syntactically as well as on the knowledge level. Take
550: the results from a run of these simple but robust heuristics as an
551: initial version of a domain dependent lexicon and ontology.
552: Exploit these initial resources to extend the processing to more
553: complicated linguistic structures in order to detect and classify
554: more terms of interest automatically.
555:
556: An example: In the sublanguage of autopsy protocols (in German) a
557: very telegrammatic style is dominant. Condensed and compact
558: structures like the following are very frequent:
559:
560: \begin{quotation}
561: \noindent
562: \emph{Harnblase leer.}\newline \emph{Harnleiter frei.}
563: \newline \emph{Nierenoberflaeche glatt.}
564: \newline \emph{Vorsteherdruese altersentsprechend.} \newline \dots
565: \end{quotation}
566:
567: These structures can be abstracted syntactically as
568: $<$Noun$>$$<$Adjective$>$$<$Fullstop$>$ and semantically as
569: reporting a finding in the form $<$Anatomic-entity$>$ has
570: $<$Attribute-value$>$ and they are easily detectable
571: \cite{roesner02}.
572:
573: In our experiments we have exploited this characteristic of the
574: corpus extensively to automatically deduce an initial lexicon
575: (with nouns and adjectives) and ontology (with concepts for
576: anatomic regions or organs and their respective features and
577: values). The feature values were further exploited to cluster the
578: concept candidates into groups according to their feature values.
579: In this way container like entities with feature values like
580: `leer' (empty) or `gefuellt' (full) can be distinguished from e.g.
581: entities of surface type with feature values like `glatt'
582: (smooth).
583:
584:
585: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
586: \section*{Related Work}
587: The work in XDOC has been inspired by a number of precursory
588: projects:
589:
590: In GATE \cite{GATESite,GATE} the idea of piping simple modules in
591: order to achieve complex functionality has been applied to NLP
592: with such a rigid architecture for the first time. The project LT
593: XML has been pioneering XML as a data format for linguistic
594: processing.
595:
596: Both GATE and LT XML \cite{ltxml99} were employed for processing
597: English texts. SMES \cite{Neumann97} has been an attempt to
598: develop a toolbox for message extraction from German texts. A
599: disadvantage of SMES that is avoided in XDOC is the lack of a
600: uniform encoding formalism, in other words, users are confronted
601: with different encodings and formats in each module.
602:
603: \section*{System availability}
604:
605: Major components of XDOC are made publicly accessible for testing
606: and experiments under the URL:
607:
608: {\bf http://lima.cs.uni-magdeburg.de:8000/ }
609:
610:
611: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
612: \section*{Summary}
613:
614: We have reported about the current state of the XDOC document
615: suite. This collection of tools for the flexible and robust
616: processing of documents in German is based on the use of XML as
617: unifying formalism for encoding input and output data as well as
618: process information. It is organized in modules with limited
619: responsibilities that can easily be combined into pipelines to
620: solve complex tasks. Strong emphasis is laid on a number of
621: techniques to deal with lexical and conceptual gaps and to
622: guarantee robust systems behaviour without the need for a priori
623: investment in resource creation by users. When end users are first
624: confronted with the system they typically are interested in quick
625: progress in their application but should not be forced to be
626: engaged e.g. in lexicon build up and grammar debugging, before
627: being able to start with experiments. This is not to say that
628: creation of specialized lexicons is unnecessary. There is a strong
629: correlation between prior investment in resources and improved
630: performance and higher quality of results. Our experience shows
631: that initial results in experiments are a good motivation for
632: subsequent efforts of users and investment in extended and
633: improved linguistic resources but that a priori costs may be
634: blocking the willingness of users to get really involved.
635:
636:
637:
638: %\nocite{ex1,ex2}
639: \bibliographystyle{acl}
640: \bibliography{coling}
641:
642: \end{document}
643: