cs0105030/olac.tex
1: \documentclass[11pt]{article}
2: \usepackage{acl2001,times,hyphen}
3: \setlength\titlebox{6.5cm}
4: 
5: \usepackage{times,epsfig,boxedminipage,url}
6: 
7: \title{The OLAC Metadata Set and Controlled Vocabularies}
8: \author{Steven Bird \\
9:   Linguistic Data Consortium \\
10:   University of Pennsylvania \\
11:   3615 Market Street, Suite 200 \\ 
12:   Philadelphia, PA 19104-2608, USA \\ 
13:   {\tt sb@ldc.upenn.edu} \And
14: Gary Simons \\
15:   SIL International \\
16:   7500 West Camp Wisdom Road \\
17:   Dallas, TX 75236, USA \\
18:   {\tt Gary\_Simons@sil.org}}
19: 
20: \date{}
21: 
22: \def\myurl#1{{[\small\url{#1}]}}
23: 
24: \def\elt#1{{\small\sf #1}}
25: \def\attr#1{{\small\sf #1}}
26: \def\code#1{{\small\sf #1}}
27: 
28: \begin{document}
29: \maketitle
30: 
31: \begin{abstract}
32: As language data and associated technologies proliferate and
33: as the language resources community rapidly expands,
34: it has become difficult to locate and reuse existing
35: resources.  Are there any lexical resources for such-and-such a language?
36: What tool can work with transcripts in this particular
37: format?  What is a good format to use for linguistic data of this type?
38: Questions like these dominate many mailing lists, since web search engines are
39: an unreliable way to find language resources.
40: This paper describes a new digital infrastructure for language resource
41: discovery, based on the Open Archives Initiative, and called
42: OLAC -- the Open Language Archives Community.
43: The OLAC Metadata Set and the associated controlled
44: vocabularies facilitate consistent description and focussed searching.
45: We report progress on the metadata set and controlled vocabularies, describing
46: current issues and soliciting input from the language
47: resources community.
48: \end{abstract}
49: 
50: \section{Introduction}
51: 
52: Language technology and the linguistic sciences are
53: confronted with a vast array of \emph{language resources},
54: richly structured, large and diverse.
55: Multiple \emph{communities} depend on language resources, including
56: linguists, engineers, teachers and actual speakers.
57: Many individuals and institutions provide key pieces of the infrastructure,
58: including archivists, software developers, and publishers.
59: Today we have unprecedented opportunities to \emph{connect}
60: these communities to the language resources they need.
61: First, inexpensive mass storage technology permits large resources to
62: be stored in digital form, while
63: the Extensible Markup Language (XML) and Unicode provide flexible
64: ways to represent structured data and ensure its long-term survival.
65: Second, digital publication -- both on and off the world wide web --
66: is the most practical and efficient means of sharing language resources.
67: Finally, a standard resource description model, the Dublin Core Metadata
68: Set, together with an interchange method provided by the Open Archives
69: Initiative (OAI), make it possible to construct a union catalog over multiple
70: repositories and archives.
71: 
72: In December 2000, an NSF-funded workshop on Web-Based Language
73: Documentation and Description, held in Philadelphia, brought together a
74: group of nearly 100 language software developers, linguists, and archivists
75: who are responsible for creating language resources in North America, South
76: America, Europe, Africa, the Middle East, Asia and Australia
77: \url{http://www.ldc.upenn.edu/exploration/expl2000/}.
78: The outcome of the workshop was the founding of the
79: Open Language Archives Community (OLAC),
80: an application of the OAI to digital archives of
81: language resources, with the following purpose:
82: 
83: \begin{quote}
84: OLAC, the Open Language Archives Community, is an international partnership
85: of institutions and individuals who are creating a worldwide virtual
86: library of language resources by: (i)~developing consensus on best current
87: practice for the digital archiving of language resources, and
88: (ii)~developing a network of interoperating repositories and services for
89: housing and accessing such resources.
90: \end{quote}
91: 
92: This paper will describe the leading ideas that motivate OLAC, before
93: focussing on the metadata set and the controlled vocabularies which
94: implement part (ii) of OLAC's statement of purpose.
95: Metadata elements of special interest to the language resources community
96: include such things as language identification
97: and language resource type.  The corresponding controlled vocabularies
98: ensure consistent description.  For example, French language resources
99: are specified using an official RFC-3066 designation \cite{Alvestrand01},
100: instead of multiple
101: distinct text strings like ``French'', ``Francais'' and ``Fran\c{c}ais''.
102: A separate controlled vocabulary exists for resource type, and has
103: items such as \code{annotation/phonetic} and \code{description/grammar}.
104: Services for end-users can map controlled vocabularies onto
105: convenient terminology for any target language.
106: (A live demonstration accompanies this presentation.)
107: 
108: \section{Locating Data, Tools and Advice}
109: 
110: We can observe that the
111: individuals who use and create language resources
112: are looking for three things: data, tools, and advice.
113: By DATA we mean any information that documents or describes a language,
114: such as a published monograph, a computer data file, or
115: even a shoebox full of hand-written index cards. The information could range
116: in content from unanalyzed sound recordings to fully transcribed and annotated
117: texts to a complete descriptive grammar. 
118: By TOOLS we mean computational resources that facilitate creating, viewing,
119: querying, or otherwise using language data. Tools include not just software
120: programs, but also the digital resources that the programs depend on, such as
121: fonts, stylesheets, and document type definitions.
122: By ADVICE we mean any information about
123: what data sources are reliable, what tools are appropriate in a given
124: situation, what practices to follow when creating new data, and so forth.
125: In the context of OLAC, the term \emph{language resource} is broadly
126: construed to include all three of these: data, tools and advice.
127: 
128: \begin{figure}
129: \centerline{\includegraphics[width=\linewidth]{vision2.ps}}
130: \caption{In reality the user can't always get there from here}
131: \label{fig:vision2}
132: \end{figure}
133: %
134: Unfortunately, today's user does not have ready access to the resources
135: that are needed. Figure~\ref{fig:vision2}
136: offers a diagrammatic view of the reality.
137: Some archives (e.g. Archive 1) do have a site on the internet which the user is
138: able to find, so the resources of that archive are accessible. Other archives
139: (e.g. Archive 2) are on the internet, so the user could access them in theory,
140: but the user has no idea they exist so they are not accessible in practice.
141: Still other archives (e.g. Archive 3) are not even on the internet. And there
142: are potentially hundreds of archives (e.g. Archive $n$) that the user
143: needs to know about. Tools and advice are out there as well, but are at many
144: different sites.
145: 
146: There are many other problems inherent
147: in the current situation. For instance, the user may not be able to find all
148: the existing data about the language of interest because different sites have
149: called it by different names (low \emph{recall}).
150: The user may be swamped with irrelevant resources because search terms
151: have important meanings in other domains (low \emph{precision}).
152: The user may not be able to use an accessible
153: data file for lack of being able to match it with the right tools. The user may
154: locate advice that seems relevant but have no basis for judging its merits.
155: 
156: \subsection{Bridging the gap}
157: 
158: \subsubsection{Why improved web-indexing is not enough}
159: 
160: As the internet grows and web-indexing technologies improve one might hope
161: that a general-purpose search engine should be sufficient to bridge the gap
162: between people and the resources they need, but this is a vain hope.
163: The first reason is that many language resources, such as audio files
164: and software, are not text-based.  The second
165: reason concerns language identification, the single most important
166: property for describing language resources.  If a language has a canonical name
167: which is distinctive as a character string, then the user has a chance of
168: finding any online resources with a search engine.
169: However, the language may have
170: multiple names, possibly due to the vagaries of Romanization, such as a
171: language known variously as Fadicca, Fadicha, Fedija, Fadija, Fiadidja,
172: Fiyadikkya, and Fedicca (giving low recall).
173: The language name may collide with a word which has
174: other interpretations that are vastly more frequent, e.g.\ the language
175: names Mango and Santa Cruz (giving low precision).
176: 
177: The third reason why general-purpose search engines are inadequate is
178: the simple fact that much of the material is not,
179: and will not, be documented in free prose on the web.
180: Either people will build systematic catalogues of their resources,
181: or they won't do it at all.
182: Of course, one can always export a back-end database
183: as HTML and let the search engines index the materials.
184: Indeed, encouraging people to document resources and make them
185: accessible to search engines is part of our vision.
186: However, despite the power of web search engines, there remain many
187: instances where people still prefer to use more formal databases to
188: house their data.
189: 
190: This last point bears further consideration.  The challenge is to
191: build a system for ``bringing like things together and differentiating among
192: them'' \cite{Svenonius00}.
193: There are two dominant storage
194: and indexing paradigms, one exemplified by traditional databases and one
195: exemplified by the web.  In the case of language resources, the metadata is
196: coherent enough to be stored in a formal database, but sufficiently
197: distributed and dynamic that it is impractical to maintain it centrally.
198: Language resources occupy the middle ground between the two paradigms, neither of which
199: will serve adequately.  A new framework is required that permits the best of
200: both worlds, namely bottom-up, distributed initiatives, along with consistent,
201: centralized finding aids.  The Dublin Core (DC) and the
202: Open Archives Initiative provide the framework we need to ``bridge the gap.''
203: 
204: \subsubsection{The Dublin Core Metadata Initiative}
205: 
206: The Dublin Core Metadata Initiative began in 1995 to develop
207: conventions for resource discovery on the web
208: \myurl{dublincore.org}.
209: The Dublin Core metadata elements represent a broad, interdisciplinary
210: consensus about
211: the core set of elements that are likely to be widely useful to support
212: resource discovery.  The Dublin Core consists of 15 metadata elements,
213: where each element is optional and repeatable: \elt{Title, Creator, Subject,
214: Description, Publisher, Contributor, Date, Type, Format, Identifier, Source,
215: Language, Relation, Coverage, Rights}.
216: This set can be used to describe resources that
217: exist in digital or traditional formats.
218: 
219: In ``Dublin Core Qualifiers'' \cite{DCQ00}
220: two kinds of qualifications are allowed: encoding schemes and refinements. An
221: {\it encoding scheme} specifies a particular controlled vocabulary or notation
222: for expressing the value of an element. The encoding scheme serves to aid a
223: client system in interpreting the exact meaning of the element content. A
224: {\it refinement} makes the meaning of the element more specific.
225: For example,
226: a \elt{Language} element can be {\it encoded}
227: using the conventions of RFC 3066 to unambiguously identify the language
228: in which the resource is written (or spoken).
229: A \elt{Subject} element can be given a language {\it refinement}
230: to restrict its interpretation to concern the language the resource is about.
231: 
232: \subsubsection{The Open Archives Initiative}
233: 
234: The Open Archives Initiative (OAI)
235: was launched in October 1999 to provide a common framework across
236: electronic preprint archives, and it has since been broadened
237: to include digital repositories of scholarly materials regardless
238: of their type
239: \myurl{www.openarchives.org} \cite{LagozeVandeSompel01}.
240: 
241: \begin{figure}
242: \centerline{\includegraphics[width=\linewidth]{white-paper1}}
243: \caption{Bridging the gap through community infrastructure}
244: \label{fig:white-paper1}
245: \end{figure}
246: %
247: In the OAI infrastructure, each participating archive implements a
248: repository -- a network accessible server offering public access
249: to archive holdings. The primary object in an OAI-conformant
250: repository is called an {\it item}, having a unique identifier
251: and being associated with one or more metadata records.
252: Each metadata record describes an archive holding, which is any
253: kind of primary resource such as a document, raw data, software, a
254: recording, a physical artifact, a digital surrogate, and so forth.
255: Each metadata record will usually contain a reference to an entry
256: point for the holding, such as a URL or a physical location,
257: as shown in Figure~\ref{fig:white-paper1}.
258: 
259: To implement the OAI infrastructure, a participating archive must comply
260: with two standards: the {\it OAI shared metadata set} (Dublin Core), which
261: facilitates interoperability across all repositories participating in the
262: OAI, and the {\it OAI metadata harvesting protocol}, which allows
263: software services to query a repository using HTTP requests.
264: 
265: OAI archives are called ``data providers,'' though they are strictly just
266: {\it metadata} providers. Typically, data providers will also have a
267: submission procedure, together with a long-term storage system, and a
268: mechanism permitting users to obtain materials from the archive. An OAI
269: ``service provider'' is a third party that provides end-user services (such
270: as search functions over union catalogs) based on metadata harvested from
271: one or
272: more OAI data providers.  Figure~\ref{fig:white-paper2}
273: illustrates a
274: single service provider accessing three data providers
275: (using the OAI metadata harvesting protocol).
276: End-users only interact with service providers.
277: 
278: Over the past decade, the Linguist List has become the primary
279: source of online information for
280: the linguistics community, reaching out to over 13,000
281: subscribers worldwide, and having four complete mirror sites.
282: The Linguist List will be augmenting its service by hosting the
283: primary service provider for OLAC, and permitting end-users to browse
284: distributed language resources at a single place.
285: 
286: \begin{figure}
287: \centerline{\includegraphics[width=\linewidth]{white-paper2.ps}}
288: \caption{A Service Provider Accessing Multiple Data Providers}
289: \label{fig:white-paper2}
290: \end{figure}
291: 
292: \subsection{Applying the OAI to language resources}
293: 
294: The OAI infrastructure is a new invention;
295: it has the bottom-up, distributed character of the web,
296: while simultaneously having the efficient, structured
297: nature of a centralized database.  This combination is well-suited to
298: the language resource community, where the available data is growing
299: rapidly and where a large user-base is fairly consistent in how it describes
300: its resource needs.
301: 
302: The primary outcome of the Philadelphia
303: workshop was the founding of the Open Language
304: Archives Community, and with it the identification of an advisory board, alpha
305: testers and member archives.  Details of these groups are available from
306: the OLAC site \myurl{www.language-archives.org}.
307: 
308: Recall that the OAI community is defined by the archives which
309: comply with the OAI metadata harvesting protocol
310: and that register with the OAI.
311: Any compliant repository can register as an Open Archive, and
312: the metadata provided by an Open Archive is open to the public.
313: OAI data providers may support metadata standards in addition to the
314: Dublin Core.  Thus, a specialist community can define a metadata format which is
315: specific to its domain.  Service providers, data providers and users that
316: employ this specialized metadata format constitute an OAI \emph{subcommunity}.
317: The workshop participants agreed unanimously that the
318: OAI provides a significant piece of the infrastructure
319: needed for the language resources community.
320: 
321: In the same way that OLAC represents a specialized
322: subcommunity with respect to the entire Open Archives community, there are
323: specialized subcommunities within the scope of OLAC.  For
324: instance, the ISLE Meta Data Initiative is developing a detailed metadata
325: scheme for corpora of recorded speech events and their associated descriptions
326: \cite{IMDI00}.
327: Similarly, the language data centers -- the Linguistic Data Consortium (LDC)
328: and the European Language Resources Association (ELRA) -- are using OLAC
329: metadata as the basis of a joint catalog, and will add elements and
330: vocabularies for their specialized needs (price, rights, and categories
331: of membership and use).
332: For archived language resources that are of this kind, such a metadata scheme would
333: support a richer description.  This specialized subcommunity can implement its own
334: service provider that offers focused searching based on its own rich metadata
335: set.  At the same time, the data providers will exposing OLAC and
336: Dublin Core versions of the metadata, permitting the resources to be
337: discovered by users of OLAC and OAI service providers.
338: 
339: \subsection{Federation and integration of language resource archives}
340: 
341: \begin{figure*}[tbhp]
342: \begin{center}
343: {\normalsize
344: \framebox{
345: \begin{tabular}{lp{0.68\textwidth}}
346: \multicolumn{2}{l}{\normalsize\bf oai:ldc:LDC94T5} \\
347: Date: 
348: & 1994\\
349: Title: 
350: & ECI Multilingual Text\\
351: Type:
352: & text\\
353: Identifier: 
354: & 1-58563-033-3\\
355: Subject.language:
356: & Albanian, {\bf Bulgarian}, Chinese, Czech, Dutch,
357:   English, Estonian, French, Gaelic, German, Greek,
358:   Italian, Japanese, Latin, Lithuanian, Malay,
359:   Spanish, Danish, Uzbek, Norwegian, Portuguese,
360:   Russian, Serbian, Swedish, Turkish, Tibetan \\
361: Identifier:
362: & http://www.ldc.upenn.edu/Catalog/LDC94T5.html \\
363: Description: 
364: & Recommended Applications:
365: information retrieval, machine translation, language modeling\\[1ex]
366: 
367: \multicolumn{2}{l}{\normalsize\bf oai:elra:L0030} \\
368: Title:
369: & Bulgarian Morphological Dictionary \\
370: Date: 
371: & 1998 \\
372: Subject.language:
373: & {\bf Bulgarian} \\
374: Description: 
375: & 67,500 entries divided into 242 inflectional types
376: (including proper nouns), morphosyntactic information for each
377: entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for
378: morphological analysis and generation \\
379: Identifier:
380: & http://www.icp.inpg.fr/ELRA/cata/text\_det.html\#bulmodic \\[1ex]
381: 
382: \multicolumn{2}{l}{\normalsize\bf oai:dfki:KPML}\\
383: Title:
384: & KPML \\
385: Creator: 
386: & Bateman and many others \\
387: Subject.language: 
388: & Spanish, Russian, Japanese, Greek, German, French, English, Czech, {\bf Bulgarian}\\
389: Format.os:
390: & Windows NT, Windows 98, Windows 95/98, Solaris \\
391: Type.functionality: 
392: & Software: Annotation Tools, Grammars, Lexica, Development Tools,
393:   Formalisms, Theories, Deep Generation, Morphological Generation,
394:   Shallow Generation \\
395: Description: 
396: & Natural Language Generation Linguistic Resource Development and
397: Maintenance workbench for large scale generation grammar development,
398: teaching, and experimental generation. Based on systemic-functional
399: linguistics. Descendent of the Penman NLG system. \\
400: Identifier:
401: & http://www.purl.org/net/kpml \\
402: Description: 
403: & Contact: bateman@uni-bremen.de \\
404: Relation.requires:
405: & Windows: none; Solaris: CommonLisp + CLIM
406: \end{tabular}}}
407: \end{center}
408: \caption{Querying the Prototype Service Provider for Bulgarian Resources}
409: \label{fig:sp}
410: \end{figure*}
411: 
412: The OAI framework permits archives to interoperate.  OAI archives support
413: the Dublin Core metadata format and metadata harvesting protocol.  OLAC
414: archives additionally support the OLAC metadata format.  Widespread
415: adoption of these standards will permit language resource archives to
416: be federated and integrated.
417: 
418: First, a collection of archives which support the same metadata format can be
419: federated, in the sense that a virtual meta-archive can collect all the
420: information into a single place, and end-users can query multiple archives
421: simultaneously.  To demonstrate this,
422: the Linguistic Data Consortium has harvested the catalogs of
423: three language resource
424: archives (LDC, ELRA, DFKI) and created a prototype service provider.
425: A search for \attr{language=Bulgarian} returns records from all three archives,
426: as shown in Figure~\ref{fig:sp} \cite{BanikBird01}.
427: 
428: Second, a collection of archives which support the same metadata format can be
429: integrated, in the sense that relational joins can be performed
430: across different archives.  This permits queries such as:
431: ``find all lexicon tools that understand a format for which Hungarian
432: data is available.''
433: 
434: \section{A Core Metadata Set for Language Resources}
435: \label{sec:metadata}
436: 
437: The OLAC Metadata Set extends the Dublin Core set only to
438: the minimum degree required to express basic properties
439: of language resources which are useful as finding aids.
440: 
441: All fifteen Dublin Core elements are used in the OLAC Metadata Set. In
442: order to suit the specific needs of the language resources community, the
443: elements have been qualified following principles articulated in
444: ``Dublin Core Qualifiers'' \cite{DCQ00}
445: and exemplified in \cite{DCQHTML00}.
446: 
447: This section describes some of
448: the attributes, elements and controlled vocabularies of
449: the OLAC Metadata Set.  Before launching into this discussion, we first
450: review some XML terminology and explain some aspects of the OLAC
451: representation which follow directly from our choice of XML.
452: 
453: \subsection{Aside: XML representation}
454: 
455: The Extensible Markup Language (XML) is the universal format for structured
456: documents and data on the Web \myurl{www.w3.org/XML}.
457: The key building block of an XML document is the \emph{element}.
458: An element has a \emph{name}, \emph{attributes} and \emph{content}.
459: Here is an example of an element \elt{Language} with attributes
460: \attr{refine} and \attr{code}, and free-text content:
461: 
462: {\small
463: \begin{verbatim}
464: <Language refine="OLAC" code="x-sil-BAN">
465:   Foreke Dschang</Language>
466: \end{verbatim}
467: }
468: 
469: In general, XML elements may contain other elements, or they may be empty.
470: XML Document Type Definitions (DTDs) and XML schemas are grammars that
471: define the structure of a valid XML document,
472: and they limit the arrangement of XML elements in a
473: document.  We believe it is important to use a formal mechanism for validating
474: a metadata record.  Following the OAI, we use XML schemas to specify the OLAC
475: metadata format.
476: 
477: XML schemas make it possible for element content and attribute values
478: to be constrained according to the element name.  However, XML schemas do not
479: permit element content to be constrained on the basis of the attribute value.
480: Accordingly, in implementing qualified Dublin Core using XML,
481: we are limited to using
482: one encoding scheme (or controlled vocabulary) per element.
483: 
484: There are two cases we need to consider here.  In the case where all
485: refinements of an element employ the same encoding scheme, we use the element
486: name as is and add a \attr{refine} attribute with a fixed value.  This
487: documents that the particular encoding scheme has been used, and ensures that
488: the element cannot be confused with a corresponding unqualified
489: Dublin Core element (see the above example).
490: In the case where different refinements of an element employ different encoding
491: schemes, then a unique element must be defined.  Following
492: \cite{DCQHTML00}, we define such elements by concatenating the
493: Dublin Core element name
494: and the refinement name with an intervening dot.  An example is shown below:
495: 
496: {\small
497: \begin{verbatim}
498: <Format.encoding code="iso-8859-1"/>
499: \end{verbatim}
500: }
501: 
502: \subsection{Attributes used in implementing the OLAC Metadata Set}
503: 
504: Three attributes -- \attr{refine}, \attr{code}, and \attr{lang} -- are used
505: throughout the metadata set to handle most qualifications to Dublin Core. Some
506: elements in the OLAC Metadata Set use the \attr{refine} attribute to identify
507: element refinements. These qualifiers make the meaning of an element narrower
508: or more specific. A refined element shares the meaning of the unqualified
509: element, but with a more restricted scope \cite{DCQ00}.
510: 
511: Some elements in the OLAC Metadata Set use the \attr{code} attribute to
512: hold metadata values that are taken from a specific encoding scheme. When an
513: element may take this attribute, the attribute value specifies a precise value
514: for the element taken from a controlled vocabulary or formal notation
515: (\S\ref{sec:cv}).
516: In such cases, the element content may also be used
517: to specify a freeform elaboration of the coded value.
518: 
519: Every element in the OLAC Metadata Set may use the \attr{lang} attribute.
520: It specifies the language in which the text in the content of the element is
521: written. The value for the attribute comes from a controlled vocabulary
522: OLAC-Language.
523: By default, the \attr{lang} attribute has
524: the value ``en'', for English. Whenever the language of the element content is
525: other than English, the \attr{lang} attribute should be used to identify the
526: language. By using multiple instances of the metadata elements tagged for
527: different languages, data providers may offer their metadata records in
528: multiple languages.
529: 
530: In addition, there is a \attr{lang} attribute on the \verb|<olac>|
531: element that contains the metadata elements for a given metadata record. It
532: lists the languages in which the metadata record is designed to be read. This
533: attribute holds a space-delimited list of language codes.
534: By default, this attribute has
535: the value ``en'', for English, indicating that the record is aimed only at
536: English readers. If an explicit value is given for the attribute, then the
537: record is aimed at readers of all the languages listed.
538: 
539: Service providers should use this information in order to offer
540: multilingual views of the metadata. When a metadata record lists only one
541: alternative language, then all elements are displayed (regardless of their
542: individual languages), unless the user has requested to suppress all records in
543: that language. When a metadata record has multiple alternative languages, the
544: user should be able to select one and have display of elements in the other
545: languages suppressed. An element in a language not included in the list of
546: alternatives should always be displayed (for instance, the vernacular title of
547: a work).
548: 
549: \subsection{The elements of the OLAC Metadata Set}
550: 
551: In this section we present a synopsis of the elements of the OLAC metadata
552: set.  For each element, we provide a one sentence definition followed by a
553: brief discussion, systematically borrowing and adapting the definitions
554: provided by the Dublin Core Metadata Initiative \cite{DCMES99}.  Each element
555: is optional and repeatable.
556: 
557: \begin{description}\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}
558: \item[\elt{Contributor}:]
559: {\bf An entity responsible for making contributions to the content
560: of the resource.}
561: Examples of a Contributor include a person, an organization, or a
562: service.
563: The \attr{refine} attribute is optionally used to specify the role
564: played by the named entity in
565: the creation of the resource, using the controlled vocabulary OLAC-Role.
566:       
567: \item[\elt{Coverage}:]
568: {\bf The extent or scope of the content of the resource.}
569: Coverage will typically include spatial location or temporal period.
570: Where the geographical information is predictable from the language identification,
571: it is not necessary to specify geographic coverage.
572: 
573: \item[\elt{Creator}:]
574: {\bf An entity primarily responsible for making the content of the resource.}
575: The \attr{refine} attribute is optionally used to specify the role
576: played by the named entity in
577: the creation of the resource, using the controlled vocabulary OLAC-Role.
578: 
579: \item[\elt{Date}:]
580: {\bf A date associated with an event in the life cycle of the resource.}
581: The \attr{refine} attribute is optionally used to refine the meaning
582: of the date using values from a controlled vocabulary (for instance, date of
583: creation versus date of issue versus date of modification, and so on). The
584: vocabulary for refinements to Date is defined in \cite{DCQ00}.
585: 
586: \item[\elt{Description}:]
587: {\bf An account of the content of the resource.}
588: Description may include but is not limited to: an abstract, table of
589: contents, reference to a graphical representation of content, or a free-text
590: account of the content.
591: 
592: \item[\elt{Format}:]
593: {\bf The physical or digital manifestation of the resource.}
594: Typically, \elt{Format} may include the media-type or dimensions of the
595: resource. \elt{Format} may be used to determine the software, hardware or other
596: equipment needed to use the resource.
597: The \attr{code} attribute identifies
598: the format using the controlled vocabulary OLAC-Format.
599: 
600: \item[\elt{Format.cpu}:]
601: {\bf The CPU required to use a software resource.}
602: The \attr{code} attribute identifies the CPU using the
603: controlled vocabulary OLAC-CPU.
604: 
605: \item[\elt{Format.encoding}:]
606: {\bf An encoded character set used by a digital resource.}
607: For a digitally encoded text, \elt{Format.encoding} names
608: the encoded character set it uses. For a font,
609: \elt{Format.encoding} names an encoded character set that it is able to render. For a
610: software application, \elt{Format.encoding} names an encoded
611: character set that it can read or write.
612: The \attr{code} attribute is used to identify the character set
613: using the controlled vocabulary OLAC-Encoding.
614: 
615: \item[\elt{Format.markup}:]
616: {\bf The OAI identifier for the definition of the markup format.}
617: \elt{Format.markup} provides
618: an OAI identifier for an XML DTD, schema or some other definition
619: of the markup format.  (This has the side-effect of ensuring that
620: the format definition is archived somewhere).
621: For a software resource,
622: \elt{Format.markup} names a markup scheme that it can read or write.
623: The \attr{code} attribute identifies the markup scheme
624: using the controlled vocabulary OLAC-Markup.
625: 
626: \item[\elt{Format.os}:]
627: {\bf The operating system required to use a software resource.}
628: The \attr{code} attribute is used to identify the operating system using the
629: controlled vocabulary OLAC-OS.  Additional restrictions for
630: operating system version, may be specified using the element content.
631: 
632: \item[\elt{Format.sourcecode}:]
633: {\bf The programming language(s) of software distributed in source form.}
634: The \attr{code} attribute identifies the language using the controlled
635: vocabulary OLAC-Sourcecode.
636: 
637: \item[\elt{Identifier}:]
638: {\bf An unambiguous reference to the resource within a given context.}
639: Recommended best practice is to identify the resource by means of a
640: string or number conforming to a globally-known formal identification system
641: (e.g. URIs, ISBNs).
642: For non-digital archives, Identifier may use
643: the existing scheme for locating a resource within the collection.
644: 
645: \item[\elt{Language}:]
646: {\bf A language of the intellectual content of the resource.}
647: \elt{Language} is used for a language the resource is in, as opposed to the
648: language it describes (see \elt{Subject.language}).
649: It identifies a language that the creator of the
650: resource assumes that its eventual user will understand.
651: The \attr{code} attribute is used to make a precise
652: identification of the language using the controlled vocabulary OLAC-Language.
653: 
654: \item[\elt{Publisher}:]
655: {\bf An entity responsible for making the resource available.}
656: Examples of a publisher include a person, an organization, or a
657: service.
658: 
659: \item[\elt{Relation}:]
660: {\bf A reference to a related resource.}
661: This element is used to document relationships between resources.
662: The \attr{refine} attribute is used to refine the nature of the
663: relationship using values from a controlled vocabulary (for instance, is
664: replaced by, requires, is part of, and so on). The vocabulary for refinements
665: to Relation is defined in \cite{DCQ00}.
666: 
667: \item[\elt{Rights}:]
668: {\bf Information about rights held in and over the resource.}
669: Typically, a \elt{Rights} element will contain a rights management
670: statement for the resource, or reference a service providing such information.
671: Rights information often encompasses intellectual property rights (IPR),
672: copyright, and various property rights.
673: The \attr{code} attribute is used to make a summary statement
674: about rights using the controlled vocabulary OLAC-Rights.
675: 
676: \item[\elt{Rights.software}:]
677: {\bf Information about rights held in and over a software resource.}
678: A rights statement pertaining to software, using the controlled
679: vocabulary OLAC-Software-Rights.
680: 
681: \item[\elt{Source}:]
682: {\bf A reference to a resource from which the present resource is derived.}
683: For instance, it
684: may be the bibliographic information about a printed book of which this is the
685: electronic encoding or from which the information was extracted.
686: 
687: \item[\elt{Subject}:]
688: {\bf The topic of the content of the resource.}
689: Typically, a Subject will be expressed as keywords, key phrases or
690: classification codes that describe a topic of the resource. Recommended best
691: practice is to select a value from a controlled vocabulary or formal
692: classification scheme.
693: 
694: \item[\elt{Subject.language}:]
695: {\bf A language which the content of the resource describes or discusses.}
696: As with the Language element, a \attr{code} attribute is
697: used to identify the language precisely.
698: 
699: \item[\elt{Title}:]
700: {\bf A name given to the resource.}
701: Typically, a title will be a name by which the resource is formally known.
702: A translation of the title can be supplied in a second \elt{Title} element.
703: The \attr{lang} attribute is used to identify the language of these elements.
704: 
705: \item[\elt{Type}:]
706: {\bf The nature or genre of the content of the resource.}
707: The \attr{code} attribute is used to identify the type using the
708: Dublin Core controlled vocabulary DC-Type.
709: 
710: \item[\elt{Type.data}:]
711: {\bf The nature or genre of the content of the resource, from a linguistic
712: standpoint.}
713: Type includes terms describing general categories, functions, genres,
714: or aggregation levels for content.
715: The \attr{code} attribute is used to identify the type using the
716: controlled vocabulary OLAC-Data.
717: 
718: \item[\elt{Type.functionality}:]
719: {\bf The functionality of a software resource.}
720: The \attr{code} attribute is used to identify the type using the
721: controlled vocabulary OLAC-Functionality.
722: \end{description}
723: 
724: Observe that some elements, such as \elt{Format}, \elt{Format.encoding}
725: and \elt{Format.markup}
726: are applicable to software as well as to data.  Service providers can exploit
727: this feature to match data with appropriate software tools.
728: 
729: \subsection{The controlled vocabularies}
730: \label{sec:cv}
731: 
732: Controlled vocabularies are enumerations of legal values for the
733: \attr{code} attribute.  In some cases, more than one value applies,
734: in which case the corresponding element must be repeated, once for each
735: applicable value.  In other cases, no value is applicable ands
736: the corresponding element is simply omitted.  In yet other cases, the
737: controlled vocabulary may fail to provide a suitable item, in which case
738: a similar item can be optionally specified and a prose comment included in the
739: element content.
740: 
741: \subsubsection{OLAC-Language}
742: 
743: Language identification is an important dimension of language resource
744: classification. However, the character-string representation of language names
745: is problematic for several reasons:
746: different languages (in different parts of the world) may have the
747: same name;
748: the same language may have a different name in each country where
749: it is spoken;
750: within the same country, the preferred name for a language may
751: change over time;
752: in the early history of discovering new languages (before names
753: were standardized), different people referred to the same language by different
754: names; and
755: for languages having non-Roman orthographies, the language name
756: may have several possible romanizations.
757: Together, these facts suggest that a standard based
758: on names will not work.
759: Instead, we need a standard based on unique identifiers
760: that do not change, combined with accessible documentation that
761: clarifies the particular speech variety denoted by each identifier.
762: 
763: The information technology community has a standard for language
764: identification, namely, ISO 639 \cite{ISO639}. Part 1 of this standard
765: lists two-letter codes for identifying 160 of the world's major
766: languages; part 2 of the standard lists three-letter codes for identifying
767: about 400 languages. ISO 639 in turn forms the core of another standard, RFC
768: 3066 (formerly RFC 1766), which is the
769: standard used for language identification in the xml:lang attribute of XML and
770: in the language element of the Dublin Core metadata set.  RFC 3066
771: provides a mechanism for users to register new language identification codes
772: for languages not covered by ISO 639, but very few additional languages have
773: been registered.
774: 
775: Unfortunately, the existing standard falls far short of meeting the
776: needs of the language resources community since it fails to account for more
777: than 90\% of the world's languages, and it fails to adequately document what
778: languages the codes refer to \cite{Simons00}. However, SIL's Ethnologue
779: \cite{Grimes00} provides a complete system of language identifiers which
780: is openly available on the Web. OLAC will employ the RFC 3066 extension
781: mechanism to build additional language identifiers based on the Ethnologue
782: codes.  For the 130-plus ISO-639-1 codes having a one-to-one mapping onto
783: Ethnologue codes, OLAC will support both.  Where an ISO code is ambiguous
784: -- such as \code{mhk} for ``other Mon Khmer languages'' --
785: OLAC will require the Ethnologue code.
786: New identifiers for ancient languages, currently being developed by
787: LINGUIST List, will be incorporated.
788: These language identifiers are expressed using the \attr{code} attribute of the
789: \elt{Language} and \elt{Subject.language} elements.
790: The free-text content of these elements may be used to specify an
791: alternative human-readable name for the language (where the name
792: specified by the standard is unacceptable for some reason)
793: or to specify a dialect (where the resource is dialect-specific).
794: 
795: \subsubsection{OLAC-Data}
796: 
797: After language identification, another dimension of central importance is
798: the linguistic type of a resource.  Notions such as ``lexicon'' and
799: ``grammar'' are fundamental to OLAC, and the discourse of the
800: language resources community depends on shared assumptions about what
801: these types mean.
802: 
803: We believe that it is helpful to distinguish at least four top-level types:
804: \code{transcription}, \code{annotation}, \code{description} and
805: \code{lexicon}, each defined broadly as proposed below.
806: A \code{transcription} is
807: any time-ordered symbolic representation of a linguistic event.
808: An \code{annotation} is any kind of structured linguistic information that is
809: explicitly aligned to some spatial and/or temporal
810: extent of a linguistic record (such as a recorded signal or an image).
811: A \code{description} is any description or analysis of a language; unlike a
812: transcription or an annotation, the structure of a
813: description is independent of the structure of the
814: linguistic events that it describes.
815: A \code{lexicon} is any record-structured inventory of linguistic forms.
816: 
817: For each of these top-level types we envision a more specific vocabulary
818: to facilitate greater precision.  For example, an orthographic
819: transcription would have the code \code{transcription/orthographic}.
820: Other subtypes could include: \code{phonetic}, \code{prosodic},
821: \code{morphological}, \code{gestural}, \code{part-of-speech},
822: \code{syntactic}, \code{discourse}, \code{musical}.  The \code{annotation}
823: type would include these subtypes, and add others
824: to cover spatial annotation of images (e.g. for OCR annotation of textual
825: images or for isogloss maps).
826: 
827: The \code{description} type could have subtypes for
828: \code{grammatical}, \code{phonological}, \code{orthographic}, 
829: \code{paradigms}, \code{pedagogical}, \code{dialectal} and
830: \code{comparative}.  The \code{lexicon} type could also carry
831: subtypes to distinguish wordlists, wordnets, thesauri and
832: so forth.
833: 
834: \subsubsection{Other controlled vocabularies}
835: 
836: \begin{description}\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}
837: \item[OLAC-CPU:]
838: A vocabulary for identifying the CPU(s) for which the software is
839: available, in the case of binary distributions:
840: \code{x86}, \code{mips}, \code{alpha}, \code{ppc}, \code{sparc}, \code{680x0}.
841: 
842: \item[OLAC-Encoding:]
843: A vocabulary for identifying the character encoding used by a digital
844: resource, e.g. \code{iso-8859-1}, ...
845: 
846: \begin{figure*}[tb]
847: {\small\begin{verbatim}
848: <?xml version="1.0" encoding="UTF-8"?>
849: <olac
850:   xmlns="http://www.language-archives.org/OLAC/0.3/"
851:   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
852:   xsi:schemaLocation="http://www.language-archives.org/OLAC/0.3/
853:                 http://www.language-archives.org/OLAC/olac-0.3b1.xsd">
854:   <Title>KPML</Title>
855:   <Identifier>http://www.purl.org/net/kpml/</Identifier>
856:   <Creator refine="Author">Bateman, John</Creator>
857:   <Subject.language code="es"/> <Subject.language code="ru"/>
858:   <Subject.language code="ja"/> <Subject.language code="el"/>
859:   <Subject.language code="de"/> <Subject.language code="fr"/>
860:   <Subject.language code="en"/> <Subject.language code="cs"/>
861:   <Subject.language code="bg"/>
862:   <Format.os code="MSWindows/winNT"/> <Format.os code="MSWindows/win95"/>
863:   <Format.os code="MSWindows/win98"/> <Format.os code="Unix/Solaris"/>
864:   <Type.functionality>Annotation Tools, Grammars, Lexica, Development Tools,
865:     Formalisms, Theories, Deep Generation, Morphological Generation,
866:     Shallow Generation</type.functionality>
867:   <Relation refine="Requires">Windows: none; Solaris: CommonLisp + CLIM</Relation>
868:   <Description>Natural Language Generation Linguistic Resource Development and
869:     Maintenance workbench for large scale generation grammar development,
870:     teaching, and experimental generation. Based on systemic-functional
871:     linguistics. Descendent of the Penman NLG system.</Description>
872: </olac>
873: \end{verbatim}}
874: \caption{OLAC Metadata Record for KPML}
875: \label{fig:olac-record}
876: \end{figure*}
877: 
878: \item[OLAC-Format:]
879: A vocabulary for identifying the manifestation of the resource.
880: The representation is inspired by MIME types, e.g. \code{text/sf} for
881: SIL standard format.  (\elt{Format.markup} is used to identify the particular
882: tagset.)  It may be necessary to add new types and subtypes to cover
883: non-digital holdings, such as manuscripts, microforms, and so forth
884: and we expect to be able to incorporate an existing vocabulary.
885: 
886: \item[OLAC-Functionality:]
887: A vocabulary for classifying the functionality of software,
888: again using the MIME style of representation, and using the
889: HLT Survey as a source of categories \cite{Cole97} as advocated
890: by the ACL/DFKI Natural Language Software Registry.  For example,
891: \code{written/OCR} would cover ``written language input, print or
892: handwriting optical character recognition.''
893: 
894: \item[OLAC-OS:]
895: A vocabulary for identifying the operating system(s) for which the software
896: is available:
897: \code{Unix}, \code{MacOS}, \code{OS2}, \code{MSDOS}, \code{MSWindows}.
898: Each of these has optional subtypes, e.g.
899: \code{Unix/Linux}, \code{MSWindows/winNT}.
900: 
901: \item[OLAC-Rights:]
902: A vocabulary for classifying the rights held over a resource, e.g.:
903: \code{open}, \code{restricted}, ...
904: 
905: \item[OLAC-Role:]
906: A vocabulary for identifying the role of a contributor or creator of the
907: resource, e.g.: \code{author}, \code{editor}, \code{translator},
908: \code{transcriber}, \code{sponsor}, ...
909: 
910: \item[OLAC-Software-Rights:]
911: A vocabulary for classifying the rights held over a resource, e.g.:
912: \code{open-source}, \code{royalty-free-library},
913: \code{royalty-free-binary}, \code{commercial}, ...
914: 
915: \item[OLAC-Sourcecode:]
916: A vocabulary for identifying the programming language(s) used by
917: software which is distributed in source form, e.g.:
918: \code{C++}, \code{Java}, \code{Python}, \code{Tcl}, \code{VB}, ...
919: 
920: \end{description}
921: 
922: \section{XML Representation}
923: 
924: The OLAC metadata format consists of an XML schema for the element
925: set, and a set of schemas for the controlled vocabularies.  The
926: latest versions are available from the OLAC website.
927: 
928: Figure~\ref{fig:olac-record} shows the OLAC metadata record
929: corresponding to the KPML display from Figure~\ref{fig:sp}.
930: The top element is \elt{olac}; this references the XML namespace
931: for version 0.3b1 of the schema.  The contents of the \elt{olac}
932: element are the OLAC metadata elements, which are optional and repeatable,
933: and can occur in any order, as in Dublin Core.
934: 
935: Some elements employ the optional \attr{code} or \attr{refine} attributes,
936: and/or free-text content.  The third attribute, \attr{lang}, is not used
937: here since the free-text content is in English (specified in the XML
938: schema as the default).  For the \elt{Creator} element, the \attr{refine}
939: attribute narrows the meaning of creator to \code{Author}.  For the
940: \elt{Subject.language} elements, the \attr{code} attribute specifies
941: nine languages using Ethnologue codes.  A service provider would map these
942: codes to human-readable names.
943: 
944: The \elt{Format.os} element illustrates a two-level coding scheme,
945: consisting of an OS ``family'', followed by a specific operating
946: system.  Further details can be included in the free-text content if
947: necessary.  If a piece of software runs on all members of an OS family,
948: then the more detailed designation can be omitted, e.g. \attr{code="Unix"}.
949: The \elt{Type.functionality} element is specified using free-text content,
950: since the details of the controlled vocabulary OLAC-Functionality are still
951: being worked out.
952: 
953: \section{Conclusions}
954: 
955: The OLAC Metadata Set and controlled vocabularies are works in progress,
956: and are continuing to be
957: revised with input from participating archives and members
958: of the wider language resources community.  We hope to have provided
959: sufficient motivation and exemplification for our choices so that readers
960: will easily be able to contribute to ongoing developments.
961: 
962: Even once OLAC is completely in place, there will still be documentation tasks
963: which the creators of language resources will have to undertake, and new habits to
964: acquire.  It will always be necessary to identify and manually correct
965: inconsistent or erroneous metadata.  The OLAC controlled vocabularies will
966: need to be refined indefinitely in response to changes in the world around us.
967: The creators of language resources will need to
968: generate metadata with each new resource and place the resource in a
969: suitable archive.  The communities will need to adopt best practices for
970: archival storage formats.
971: 
972: Despite these intrinsic limitations,
973: the OLAC Metadata Set and controlled vocabularies offer a \emph{template}
974: for resource description, providing two clear benefits over traditional
975: full-text description and retrieval.  First, the template guides the
976: resource creator in giving a \emph{complete description} of the resource,
977: in contrast to prose descriptions which may omit important details.
978: And second, the template associates a resource with \emph{standard labels},
979: such as \elt{creator} and \elt{title}, permitting users to do focussed
980: searching.
981: Resources and repositories can proliferate, yet common metadata and
982: vocabularies will support centralized services giving users easy access to
983: language resources.
984: 
985: \raggedright\small
986: \bibliographystyle{acl}
987: 
988: \begin{thebibliography}{}
989: 
990: \bibitem[\protect\citename{Alvestrand}2001]{Alvestrand01}
991: Harald Alvestrand.
992: \newblock 2001.
993: \newblock {RFC} 3066: Tags for the identification of languages (replaces 1766).
994: \newblock \url{ftp://ftp.isi.edu/in-notes/rfc3066.txt}.
995: 
996: \bibitem[\protect\citename{B\'anik and Bird}2001]{BanikBird01}
997: \'Eva B\'anik and Steven Bird.
998: \newblock 2001.
999: \newblock {LDC} experimental {OLAC} service provider.
1000: \newblock \url{http://wave.ldc.upenn.edu/OLAC/sp-0.2/sp.php4}.
1001: 
1002: \bibitem[\protect\citename{Cole}1997]{Cole97}
1003: Ronald Cole, editor.
1004: \newblock 1997.
1005: \newblock {\em Survey of the State of the Art in Human Language Technology}.
1006: \newblock Studies in Natural Language Processing. Cambridge University Press.
1007: \newblock \url{http://cslu.cse.ogi.edu/HLTsurvey/}.
1008: 
1009: \bibitem[\protect\citename{{DCMI}}1999]{DCMES99}
1010: {DCMI}.
1011: \newblock 1999.
1012: \newblock {Dublin Core Metadata Element Set}, version 1.1: Reference
1013:   description.
1014: \newblock \url{http://dublincore.org/documents/1999/07/02/dces/}.
1015: 
1016: \bibitem[\protect\citename{{DCMI}}2000a]{DCQ00}
1017: {DCMI}.
1018: \newblock 2000a.
1019: \newblock {Dublin Core} qualifiers.
1020: \newblock \url{http://dublincore.org/documents/2000/07/11/dcmes-qualifiers/}.
1021: 
1022: \bibitem[\protect\citename{{DCMI}}2000b]{DCQHTML00}
1023: {DCMI}.
1024: \newblock 2000b.
1025: \newblock Recording qualified {Dublin Core} metadata in {HTML}.
1026: \newblock \url{http://dublincore.org/documents/2000/08/15/dcq-html/}.
1027: 
1028: \bibitem[\protect\citename{Grimes}2000]{Grimes00}
1029: Barbara~F. Grimes, editor.
1030: \newblock 2000.
1031: \newblock {\em Ethnologue: Languages of the World}.
1032: \newblock Dallas: Summer Institute of Linguistics, 14th edition.
1033: \newblock \url{http//www.sil.org/ethnologue/}.
1034: 
1035: \bibitem[\protect\citename{{ISO}}1998]{ISO639}
1036: {ISO}.
1037: \newblock 1998.
1038: \newblock {ISO} 639: Codes for the representation of names of languages-part 2:
1039:   Alpha-3 code.
1040: \newblock \url{http://lcweb.loc.gov/standards/iso639-2/langhome.html}.
1041: 
1042: \bibitem[\protect\citename{Lagoze and de Sompel}2001]{LagozeVandeSompel01}
1043: Carl Lagoze and Herbert~Van de~Sompel.
1044: \newblock 2001.
1045: \newblock The {Open Archives Initiative}: Building a low-barrier
1046:   interoperability framework.
1047: \newblock \url{http://www.cs.cornell.edu/lagoze/papers/oai-jcdl.pdf}.
1048: 
1049: \bibitem[\protect\citename{{MPI ISLE Team}}2000]{IMDI00}
1050: {MPI ISLE Team}.
1051: \newblock 2000.
1052: \newblock {ISLE} meta data elements for session descriptions proposal.
1053: \newblock
1054:   \url{http://www.mpi.nl/world/ISLE/documents/draft/ISLE_Metadata_2.0.pdf}.
1055: 
1056: \bibitem[\protect\citename{Simons}2000]{Simons00}
1057: Gary Simons.
1058: \newblock 2000.
1059: \newblock Language identification in metadata descriptions of language archive
1060:   holdings.
1061: \newblock In Steven Bird and Gary Simons, editors, {\em Proceedings of the
1062:   Workshop on Web-Based Language Documentation and Description}.
1063: \newblock \url{http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/}.
1064: 
1065: \bibitem[\protect\citename{Svenonius}2000]{Svenonius00}
1066: Elaine Svenonius.
1067: \newblock 2000.
1068: \newblock {\em The Intellectual Foundation of Information Organization}.
1069: \newblock The MIT Press.
1070: 
1071: \end{thebibliography}
1072: 
1073: \end{document}
1074: