0105:cs0105030/olac.tex

1: \documentclass[11pt]{article}

2: \usepackage{acl2001,times,hyphen}

3: \setlength\titlebox{6.5cm}

4:

5: \usepackage{times,epsfig,boxedminipage,url}

6:

7: \title{The OLAC Metadata Set and Controlled Vocabularies}

8: \author{Steven Bird \\

9:   Linguistic Data Consortium \\

10:   University of Pennsylvania \\

11:   3615 Market Street, Suite 200 \\

12:   Philadelphia, PA 19104-2608, USA \\

13:   {\tt sb@ldc.upenn.edu} \And

14: Gary Simons \\

15:   SIL International \\

16:   7500 West Camp Wisdom Road \\

17:   Dallas, TX 75236, USA \\

18:   {\tt Gary\_Simons@sil.org}}

19:

20: \date{}

21:

22: \def\myurl#1{{[\small\url{#1}]}}

23:

24: \def\elt#1{{\small\sf #1}}

25: \def\attr#1{{\small\sf #1}}

26: \def\code#1{{\small\sf #1}}

27:

28: \begin{document}

29: \maketitle

30:

31: \begin{abstract}

32: As language data and associated technologies proliferate and

33: as the language resources community rapidly expands,

34: it has become difficult to locate and reuse existing

35: resources.  Are there any lexical resources for such-and-such a language?

36: What tool can work with transcripts in this particular

37: format?  What is a good format to use for linguistic data of this type?

38: Questions like these dominate many mailing lists, since web search engines are

39: an unreliable way to find language resources.

40: This paper describes a new digital infrastructure for language resource

41: discovery, based on the Open Archives Initiative, and called

42: OLAC -- the Open Language Archives Community.

43: The OLAC Metadata Set and the associated controlled

44: vocabularies facilitate consistent description and focussed searching.

45: We report progress on the metadata set and controlled vocabularies, describing

46: current issues and soliciting input from the language

47: resources community.

48: \end{abstract}

49:

50: \section{Introduction}

51:

52: Language technology and the linguistic sciences are

53: confronted with a vast array of \emph{language resources},

54: richly structured, large and diverse.

55: Multiple \emph{communities} depend on language resources, including

56: linguists, engineers, teachers and actual speakers.

57: Many individuals and institutions provide key pieces of the infrastructure,

58: including archivists, software developers, and publishers.

59: Today we have unprecedented opportunities to \emph{connect}

60: these communities to the language resources they need.

61: First, inexpensive mass storage technology permits large resources to

62: be stored in digital form, while

63: the Extensible Markup Language (XML) and Unicode provide flexible

64: ways to represent structured data and ensure its long-term survival.

65: Second, digital publication -- both on and off the world wide web --

66: is the most practical and efficient means of sharing language resources.

67: Finally, a standard resource description model, the Dublin Core Metadata

68: Set, together with an interchange method provided by the Open Archives

69: Initiative (OAI), make it possible to construct a union catalog over multiple

70: repositories and archives.

71:

72: In December 2000, an NSF-funded workshop on Web-Based Language

73: Documentation and Description, held in Philadelphia, brought together a

74: group of nearly 100 language software developers, linguists, and archivists

75: who are responsible for creating language resources in North America, South

76: America, Europe, Africa, the Middle East, Asia and Australia

77: \url{http://www.ldc.upenn.edu/exploration/expl2000/}.

78: The outcome of the workshop was the founding of the

79: Open Language Archives Community (OLAC),

80: an application of the OAI to digital archives of

81: language resources, with the following purpose:

82:

83: \begin{quote}

84: OLAC, the Open Language Archives Community, is an international partnership

85: of institutions and individuals who are creating a worldwide virtual

86: library of language resources by: (i)~developing consensus on best current

87: practice for the digital archiving of language resources, and

88: (ii)~developing a network of interoperating repositories and services for

89: housing and accessing such resources.

90: \end{quote}

91:

92: This paper will describe the leading ideas that motivate OLAC, before

93: focussing on the metadata set and the controlled vocabularies which

94: implement part (ii) of OLAC's statement of purpose.

95: Metadata elements of special interest to the language resources community

96: include such things as language identification

97: and language resource type.  The corresponding controlled vocabularies

98: ensure consistent description.  For example, French language resources

99: are specified using an official RFC-3066 designation \cite{Alvestrand01},

100: instead of multiple

101: distinct text strings like ``French'', ``Francais'' and ``Fran\c{c}ais''.

102: A separate controlled vocabulary exists for resource type, and has

103: items such as \code{annotation/phonetic} and \code{description/grammar}.

104: Services for end-users can map controlled vocabularies onto

105: convenient terminology for any target language.

106: (A live demonstration accompanies this presentation.)

107:

108: \section{Locating Data, Tools and Advice}

109:

110: We can observe that the

111: individuals who use and create language resources

112: are looking for three things: data, tools, and advice.

113: By DATA we mean any information that documents or describes a language,

114: such as a published monograph, a computer data file, or

115: even a shoebox full of hand-written index cards. The information could range

116: in content from unanalyzed sound recordings to fully transcribed and annotated

117: texts to a complete descriptive grammar.

118: By TOOLS we mean computational resources that facilitate creating, viewing,

119: querying, or otherwise using language data. Tools include not just software

120: programs, but also the digital resources that the programs depend on, such as

121: fonts, stylesheets, and document type definitions.

122: By ADVICE we mean any information about

123: what data sources are reliable, what tools are appropriate in a given

124: situation, what practices to follow when creating new data, and so forth.

125: In the context of OLAC, the term \emph{language resource} is broadly

126: construed to include all three of these: data, tools and advice.

127:

128: \begin{figure}

129: \centerline{\includegraphics[width=\linewidth]{vision2.ps}}

130: \caption{In reality the user can't always get there from here}

131: \label{fig:vision2}

132: \end{figure}

133: %

134: Unfortunately, today's user does not have ready access to the resources

135: that are needed. Figure~\ref{fig:vision2}

136: offers a diagrammatic view of the reality.

137: Some archives (e.g. Archive 1) do have a site on the internet which the user is

138: able to find, so the resources of that archive are accessible. Other archives

139: (e.g. Archive 2) are on the internet, so the user could access them in theory,

140: but the user has no idea they exist so they are not accessible in practice.

141: Still other archives (e.g. Archive 3) are not even on the internet. And there

142: are potentially hundreds of archives (e.g. Archive $n$) that the user

143: needs to know about. Tools and advice are out there as well, but are at many

144: different sites.

145:

146: There are many other problems inherent

147: in the current situation. For instance, the user may not be able to find all

148: the existing data about the language of interest because different sites have

149: called it by different names (low \emph{recall}).

150: The user may be swamped with irrelevant resources because search terms

151: have important meanings in other domains (low \emph{precision}).

152: The user may not be able to use an accessible

153: data file for lack of being able to match it with the right tools. The user may

154: locate advice that seems relevant but have no basis for judging its merits.

155:

156: \subsection{Bridging the gap}

157:

158: \subsubsection{Why improved web-indexing is not enough}

159:

160: As the internet grows and web-indexing technologies improve one might hope

161: that a general-purpose search engine should be sufficient to bridge the gap

162: between people and the resources they need, but this is a vain hope.

163: The first reason is that many language resources, such as audio files

164: and software, are not text-based.  The second

165: reason concerns language identification, the single most important

166: property for describing language resources.  If a language has a canonical name

167: which is distinctive as a character string, then the user has a chance of

168: finding any online resources with a search engine.

169: However, the language may have

170: multiple names, possibly due to the vagaries of Romanization, such as a

171: language known variously as Fadicca, Fadicha, Fedija, Fadija, Fiadidja,

172: Fiyadikkya, and Fedicca (giving low recall).

173: The language name may collide with a word which has

174: other interpretations that are vastly more frequent, e.g.\ the language

175: names Mango and Santa Cruz (giving low precision).

176:

177: The third reason why general-purpose search engines are inadequate is

178: the simple fact that much of the material is not,

179: and will not, be documented in free prose on the web.

180: Either people will build systematic catalogues of their resources,

181: or they won't do it at all.

182: Of course, one can always export a back-end database

183: as HTML and let the search engines index the materials.

184: Indeed, encouraging people to document resources and make them

185: accessible to search engines is part of our vision.

186: However, despite the power of web search engines, there remain many

187: instances where people still prefer to use more formal databases to

188: house their data.

189:

190: This last point bears further consideration.  The challenge is to

191: build a system for ``bringing like things together and differentiating among

192: them'' \cite{Svenonius00}.

193: There are two dominant storage

194: and indexing paradigms, one exemplified by traditional databases and one

195: exemplified by the web.  In the case of language resources, the metadata is

196: coherent enough to be stored in a formal database, but sufficiently

197: distributed and dynamic that it is impractical to maintain it centrally.

198: Language resources occupy the middle ground between the two paradigms, neither of which

199: will serve adequately.  A new framework is required that permits the best of

200: both worlds, namely bottom-up, distributed initiatives, along with consistent,

201: centralized finding aids.  The Dublin Core (DC) and the

202: Open Archives Initiative provide the framework we need to ``bridge the gap.''

203:

204: \subsubsection{The Dublin Core Metadata Initiative}

205:

206: The Dublin Core Metadata Initiative began in 1995 to develop

207: conventions for resource discovery on the web

208: \myurl{dublincore.org}.

209: The Dublin Core metadata elements represent a broad, interdisciplinary

210: consensus about

211: the core set of elements that are likely to be widely useful to support

212: resource discovery.  The Dublin Core consists of 15 metadata elements,

213: where each element is optional and repeatable: \elt{Title, Creator, Subject,

214: Description, Publisher, Contributor, Date, Type, Format, Identifier, Source,

215: Language, Relation, Coverage, Rights}.

216: This set can be used to describe resources that

217: exist in digital or traditional formats.

218:

219: In ``Dublin Core Qualifiers'' \cite{DCQ00}

220: two kinds of qualifications are allowed: encoding schemes and refinements. An

221: {\it encoding scheme} specifies a particular controlled vocabulary or notation

222: for expressing the value of an element. The encoding scheme serves to aid a

223: client system in interpreting the exact meaning of the element content. A

224: {\it refinement} makes the meaning of the element more specific.

225: For example,

226: a \elt{Language} element can be {\it encoded}

227: using the conventions of RFC 3066 to unambiguously identify the language

228: in which the resource is written (or spoken).

229: A \elt{Subject} element can be given a language {\it refinement}

230: to restrict its interpretation to concern the language the resource is about.

231:

232: \subsubsection{The Open Archives Initiative}

233:

234: The Open Archives Initiative (OAI)

235: was launched in October 1999 to provide a common framework across

236: electronic preprint archives, and it has since been broadened

237: to include digital repositories of scholarly materials regardless

238: of their type

239: \myurl{www.openarchives.org} \cite{LagozeVandeSompel01}.

240:

241: \begin{figure}

242: \centerline{\includegraphics[width=\linewidth]{white-paper1}}

243: \caption{Bridging the gap through community infrastructure}

244: \label{fig:white-paper1}

245: \end{figure}

246: %

247: In the OAI infrastructure, each participating archive implements a

248: repository -- a network accessible server offering public access

249: to archive holdings. The primary object in an OAI-conformant

250: repository is called an {\it item}, having a unique identifier

251: and being associated with one or more metadata records.

252: Each metadata record describes an archive holding, which is any

253: kind of primary resource such as a document, raw data, software, a

254: recording, a physical artifact, a digital surrogate, and so forth.

255: Each metadata record will usually contain a reference to an entry

256: point for the holding, such as a URL or a physical location,

257: as shown in Figure~\ref{fig:white-paper1}.

258:

259: To implement the OAI infrastructure, a participating archive must comply

260: with two standards: the {\it OAI shared metadata set} (Dublin Core), which

261: facilitates interoperability across all repositories participating in the

262: OAI, and the {\it OAI metadata harvesting protocol}, which allows

263: software services to query a repository using HTTP requests.

264:

265: OAI archives are called ``data providers,'' though they are strictly just

266: {\it metadata} providers. Typically, data providers will also have a

267: submission procedure, together with a long-term storage system, and a

268: mechanism permitting users to obtain materials from the archive. An OAI

269: ``service provider'' is a third party that provides end-user services (such

270: as search functions over union catalogs) based on metadata harvested from

271: one or

272: more OAI data providers.  Figure~\ref{fig:white-paper2}

273: illustrates a

274: single service provider accessing three data providers

275: (using the OAI metadata harvesting protocol).

276: End-users only interact with service providers.

277:

278: Over the past decade, the Linguist List has become the primary

279: source of online information for

280: the linguistics community, reaching out to over 13,000

281: subscribers worldwide, and having four complete mirror sites.

282: The Linguist List will be augmenting its service by hosting the

283: primary service provider for OLAC, and permitting end-users to browse

284: distributed language resources at a single place.

285:

286: \begin{figure}

287: \centerline{\includegraphics[width=\linewidth]{white-paper2.ps}}

288: \caption{A Service Provider Accessing Multiple Data Providers}

289: \label{fig:white-paper2}

290: \end{figure}

291:

292: \subsection{Applying the OAI to language resources}

293:

294: The OAI infrastructure is a new invention;

295: it has the bottom-up, distributed character of the web,

296: while simultaneously having the efficient, structured

297: nature of a centralized database.  This combination is well-suited to

298: the language resource community, where the available data is growing

299: rapidly and where a large user-base is fairly consistent in how it describes

300: its resource needs.

301:

302: The primary outcome of the Philadelphia

303: workshop was the founding of the Open Language

304: Archives Community, and with it the identification of an advisory board, alpha

305: testers and member archives.  Details of these groups are available from

306: the OLAC site \myurl{www.language-archives.org}.

307:

308: Recall that the OAI community is defined by the archives which

309: comply with the OAI metadata harvesting protocol

310: and that register with the OAI.

311: Any compliant repository can register as an Open Archive, and

312: the metadata provided by an Open Archive is open to the public.

313: OAI data providers may support metadata standards in addition to the

314: Dublin Core.  Thus, a specialist community can define a metadata format which is

315: specific to its domain.  Service providers, data providers and users that

316: employ this specialized metadata format constitute an OAI \emph{subcommunity}.

317: The workshop participants agreed unanimously that the

318: OAI provides a significant piece of the infrastructure

319: needed for the language resources community.

320:

321: In the same way that OLAC represents a specialized

322: subcommunity with respect to the entire Open Archives community, there are

323: specialized subcommunities within the scope of OLAC.  For

324: instance, the ISLE Meta Data Initiative is developing a detailed metadata

325: scheme for corpora of recorded speech events and their associated descriptions

326: \cite{IMDI00}.

327: Similarly, the language data centers -- the Linguistic Data Consortium (LDC)

328: and the European Language Resources Association (ELRA) -- are using OLAC

329: metadata as the basis of a joint catalog, and will add elements and

330: vocabularies for their specialized needs (price, rights, and categories

331: of membership and use).

332: For archived language resources that are of this kind, such a metadata scheme would

333: support a richer description.  This specialized subcommunity can implement its own

334: service provider that offers focused searching based on its own rich metadata

335: set.  At the same time, the data providers will exposing OLAC and

336: Dublin Core versions of the metadata, permitting the resources to be

337: discovered by users of OLAC and OAI service providers.

338:

339: \subsection{Federation and integration of language resource archives}

340:

341: \begin{figure*}[tbhp]

342: \begin{center}

343: {\normalsize

344: \framebox{

345: \begin{tabular}{lp{0.68\textwidth}}

346: \multicolumn{2}{l}{\normalsize\bf oai:ldc:LDC94T5} \\

347: Date:

348: & 1994\\

349: Title:

350: & ECI Multilingual Text\\

351: Type:

352: & text\\

353: Identifier:

354: & 1-58563-033-3\\

355: Subject.language:

356: & Albanian, {\bf Bulgarian}, Chinese, Czech, Dutch,

357:   English, Estonian, French, Gaelic, German, Greek,

358:   Italian, Japanese, Latin, Lithuanian, Malay,

359:   Spanish, Danish, Uzbek, Norwegian, Portuguese,

360:   Russian, Serbian, Swedish, Turkish, Tibetan \\

361: Identifier:

362: & http://www.ldc.upenn.edu/Catalog/LDC94T5.html \\

363: Description:

364: & Recommended Applications:

365: information retrieval, machine translation, language modeling\\[1ex]

366:

367: \multicolumn{2}{l}{\normalsize\bf oai:elra:L0030} \\

368: Title:

369: & Bulgarian Morphological Dictionary \\

370: Date:

371: & 1998 \\

372: Subject.language:

373: & {\bf Bulgarian} \\

374: Description:

375: & 67,500 entries divided into 242 inflectional types

376: (including proper nouns), morphosyntactic information for each

377: entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for

378: morphological analysis and generation \\

379: Identifier:

380: & http://www.icp.inpg.fr/ELRA/cata/text\_det.html\#bulmodic \\[1ex]

381:

382: \multicolumn{2}{l}{\normalsize\bf oai:dfki:KPML}\\

383: Title:

384: & KPML \\

385: Creator:

386: & Bateman and many others \\

387: Subject.language:

388: & Spanish, Russian, Japanese, Greek, German, French, English, Czech, {\bf Bulgarian}\\

389: Format.os:

390: & Windows NT, Windows 98, Windows 95/98, Solaris \\

391: Type.functionality:

392: & Software: Annotation Tools, Grammars, Lexica, Development Tools,

393:   Formalisms, Theories, Deep Generation, Morphological Generation,

394:   Shallow Generation \\

395: Description:

396: & Natural Language Generation Linguistic Resource Development and

397: Maintenance workbench for large scale generation grammar development,

398: teaching, and experimental generation. Based on systemic-functional

399: linguistics. Descendent of the Penman NLG system. \\

400: Identifier:

401: & http://www.purl.org/net/kpml \\

402: Description:

403: & Contact: bateman@uni-bremen.de \\

404: Relation.requires:

405: & Windows: none; Solaris: CommonLisp + CLIM

406: \end{tabular}}}

407: \end{center}

408: \caption{Querying the Prototype Service Provider for Bulgarian Resources}

409: \label{fig:sp}

410: \end{figure*}

411:

412: The OAI framework permits archives to interoperate.  OAI archives support

413: the Dublin Core metadata format and metadata harvesting protocol.  OLAC

414: archives additionally support the OLAC metadata format.  Widespread

415: adoption of these standards will permit language resource archives to

416: be federated and integrated.

417:

418: First, a collection of archives which support the same metadata format can be

419: federated, in the sense that a virtual meta-archive can collect all the

420: information into a single place, and end-users can query multiple archives

421: simultaneously.  To demonstrate this,

422: the Linguistic Data Consortium has harvested the catalogs of

423: three language resource

424: archives (LDC, ELRA, DFKI) and created a prototype service provider.

425: A search for \attr{language=Bulgarian} returns records from all three archives,

426: as shown in Figure~\ref{fig:sp} \cite{BanikBird01}.

427:

428: Second, a collection of archives which support the same metadata format can be

429: integrated, in the sense that relational joins can be performed

430: across different archives.  This permits queries such as:

431: ``find all lexicon tools that understand a format for which Hungarian

432: data is available.''

433:

434: \section{A Core Metadata Set for Language Resources}

435: \label{sec:metadata}

436:

437: The OLAC Metadata Set extends the Dublin Core set only to

438: the minimum degree required to express basic properties

439: of language resources which are useful as finding aids.

440:

441: All fifteen Dublin Core elements are used in the OLAC Metadata Set. In

442: order to suit the specific needs of the language resources community, the

443: elements have been qualified following principles articulated in

444: ``Dublin Core Qualifiers'' \cite{DCQ00}

445: and exemplified in \cite{DCQHTML00}.

446:

447: This section describes some of

448: the attributes, elements and controlled vocabularies of

449: the OLAC Metadata Set.  Before launching into this discussion, we first

450: review some XML terminology and explain some aspects of the OLAC

451: representation which follow directly from our choice of XML.

452:

453: \subsection{Aside: XML representation}

454:

455: The Extensible Markup Language (XML) is the universal format for structured

456: documents and data on the Web \myurl{www.w3.org/XML}.

457: The key building block of an XML document is the \emph{element}.

458: An element has a \emph{name}, \emph{attributes} and \emph{content}.

459: Here is an example of an element \elt{Language} with attributes

460: \attr{refine} and \attr{code}, and free-text content:

461:

462: {\small

463: \begin{verbatim}

464: <Language refine="OLAC" code="x-sil-BAN">

465:   Foreke Dschang</Language>

466: \end{verbatim}

467: }

468:

469: In general, XML elements may contain other elements, or they may be empty.

470: XML Document Type Definitions (DTDs) and XML schemas are grammars that

471: define the structure of a valid XML document,

472: and they limit the arrangement of XML elements in a

473: document.  We believe it is important to use a formal mechanism for validating

474: a metadata record.  Following the OAI, we use XML schemas to specify the OLAC

475: metadata format.

476:

477: XML schemas make it possible for element content and attribute values

478: to be constrained according to the element name.  However, XML schemas do not

479: permit element content to be constrained on the basis of the attribute value.

480: Accordingly, in implementing qualified Dublin Core using XML,

481: we are limited to using

482: one encoding scheme (or controlled vocabulary) per element.

483:

484: There are two cases we need to consider here.  In the case where all

485: refinements of an element employ the same encoding scheme, we use the element

486: name as is and add a \attr{refine} attribute with a fixed value.  This

487: documents that the particular encoding scheme has been used, and ensures that

488: the element cannot be confused with a corresponding unqualified

489: Dublin Core element (see the above example).

490: In the case where different refinements of an element employ different encoding

491: schemes, then a unique element must be defined.  Following

492: \cite{DCQHTML00}, we define such elements by concatenating the

493: Dublin Core element name

494: and the refinement name with an intervening dot.  An example is shown below:

495:

496: {\small

497: \begin{verbatim}

498: <Format.encoding code="iso-8859-1"/>

499: \end{verbatim}

500: }

501:

502: \subsection{Attributes used in implementing the OLAC Metadata Set}

503:

504: Three attributes -- \attr{refine}, \attr{code}, and \attr{lang} -- are used

505: throughout the metadata set to handle most qualifications to Dublin Core. Some

506: elements in the OLAC Metadata Set use the \attr{refine} attribute to identify

507: element refinements. These qualifiers make the meaning of an element narrower

508: or more specific. A refined element shares the meaning of the unqualified

509: element, but with a more restricted scope \cite{DCQ00}.

510:

511: Some elements in the OLAC Metadata Set use the \attr{code} attribute to

512: hold metadata values that are taken from a specific encoding scheme. When an

513: element may take this attribute, the attribute value specifies a precise value

514: for the element taken from a controlled vocabulary or formal notation

515: (\S\ref{sec:cv}).

516: In such cases, the element content may also be used

517: to specify a freeform elaboration of the coded value.

518:

519: Every element in the OLAC Metadata Set may use the \attr{lang} attribute.

520: It specifies the language in which the text in the content of the element is

521: written. The value for the attribute comes from a controlled vocabulary

522: OLAC-Language.

523: By default, the \attr{lang} attribute has

524: the value ``en'', for English. Whenever the language of the element content is

525: other than English, the \attr{lang} attribute should be used to identify the

526: language. By using multiple instances of the metadata elements tagged for

527: different languages, data providers may offer their metadata records in

528: multiple languages.

529:

530: In addition, there is a \attr{lang} attribute on the \verb|<olac>|

531: element that contains the metadata elements for a given metadata record. It

532: lists the languages in which the metadata record is designed to be read. This

533: attribute holds a space-delimited list of language codes.

534: By default, this attribute has

535: the value ``en'', for English, indicating that the record is aimed only at

536: English readers. If an explicit value is given for the attribute, then the

537: record is aimed at readers of all the languages listed.

538:

539: Service providers should use this information in order to offer

540: multilingual views of the metadata. When a metadata record lists only one

541: alternative language, then all elements are displayed (regardless of their

542: individual languages), unless the user has requested to suppress all records in

543: that language. When a metadata record has multiple alternative languages, the

544: user should be able to select one and have display of elements in the other

545: languages suppressed. An element in a language not included in the list of

546: alternatives should always be displayed (for instance, the vernacular title of

547: a work).

548:

549: \subsection{The elements of the OLAC Metadata Set}

550:

551: In this section we present a synopsis of the elements of the OLAC metadata

552: set.  For each element, we provide a one sentence definition followed by a

553: brief discussion, systematically borrowing and adapting the definitions

554: provided by the Dublin Core Metadata Initiative \cite{DCMES99}.  Each element

555: is optional and repeatable.

556:

557: \begin{description}\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}

558: \item[\elt{Contributor}:]

559: {\bf An entity responsible for making contributions to the content

560: of the resource.}

561: Examples of a Contributor include a person, an organization, or a

562: service.

563: The \attr{refine} attribute is optionally used to specify the role

564: played by the named entity in

565: the creation of the resource, using the controlled vocabulary OLAC-Role.

566:

567: \item[\elt{Coverage}:]

568: {\bf The extent or scope of the content of the resource.}

569: Coverage will typically include spatial location or temporal period.

570: Where the geographical information is predictable from the language identification,

571: it is not necessary to specify geographic coverage.

572:

573: \item[\elt{Creator}:]

574: {\bf An entity primarily responsible for making the content of the resource.}

575: The \attr{refine} attribute is optionally used to specify the role

576: played by the named entity in

577: the creation of the resource, using the controlled vocabulary OLAC-Role.

578:

579: \item[\elt{Date}:]

580: {\bf A date associated with an event in the life cycle of the resource.}

581: The \attr{refine} attribute is optionally used to refine the meaning

582: of the date using values from a controlled vocabulary (for instance, date of

583: creation versus date of issue versus date of modification, and so on). The

584: vocabulary for refinements to Date is defined in \cite{DCQ00}.

585:

586: \item[\elt{Description}:]

587: {\bf An account of the content of the resource.}

588: Description may include but is not limited to: an abstract, table of

589: contents, reference to a graphical representation of content, or a free-text

590: account of the content.

591:

592: \item[\elt{Format}:]

593: {\bf The physical or digital manifestation of the resource.}

594: Typically, \elt{Format} may include the media-type or dimensions of the

595: resource. \elt{Format} may be used to determine the software, hardware or other

596: equipment needed to use the resource.

597: The \attr{code} attribute identifies

598: the format using the controlled vocabulary OLAC-Format.

599:

600: \item[\elt{Format.cpu}:]

601: {\bf The CPU required to use a software resource.}

602: The \attr{code} attribute identifies the CPU using the

603: controlled vocabulary OLAC-CPU.

604:

605: \item[\elt{Format.encoding}:]

606: {\bf An encoded character set used by a digital resource.}

607: For a digitally encoded text, \elt{Format.encoding} names

608: the encoded character set it uses. For a font,

609: \elt{Format.encoding} names an encoded character set that it is able to render. For a

610: software application, \elt{Format.encoding} names an encoded

611: character set that it can read or write.

612: The \attr{code} attribute is used to identify the character set

613: using the controlled vocabulary OLAC-Encoding.

614:

615: \item[\elt{Format.markup}:]

616: {\bf The OAI identifier for the definition of the markup format.}

617: \elt{Format.markup} provides

618: an OAI identifier for an XML DTD, schema or some other definition

619: of the markup format.  (This has the side-effect of ensuring that

620: the format definition is archived somewhere).

621: For a software resource,

622: \elt{Format.markup} names a markup scheme that it can read or write.

623: The \attr{code} attribute identifies the markup scheme

624: using the controlled vocabulary OLAC-Markup.

625:

626: \item[\elt{Format.os}:]

627: {\bf The operating system required to use a software resource.}

628: The \attr{code} attribute is used to identify the operating system using the

629: controlled vocabulary OLAC-OS.  Additional restrictions for

630: operating system version, may be specified using the element content.

631:

632: \item[\elt{Format.sourcecode}:]

633: {\bf The programming language(s) of software distributed in source form.}

634: The \attr{code} attribute identifies the language using the controlled

635: vocabulary OLAC-Sourcecode.

636:

637: \item[\elt{Identifier}:]

638: {\bf An unambiguous reference to the resource within a given context.}

639: Recommended best practice is to identify the resource by means of a

640: string or number conforming to a globally-known formal identification system

641: (e.g. URIs, ISBNs).

642: For non-digital archives, Identifier may use

643: the existing scheme for locating a resource within the collection.

644:

645: \item[\elt{Language}:]

646: {\bf A language of the intellectual content of the resource.}

647: \elt{Language} is used for a language the resource is in, as opposed to the

648: language it describes (see \elt{Subject.language}).

649: It identifies a language that the creator of the

650: resource assumes that its eventual user will understand.

651: The \attr{code} attribute is used to make a precise

652: identification of the language using the controlled vocabulary OLAC-Language.

653:

654: \item[\elt{Publisher}:]

655: {\bf An entity responsible for making the resource available.}

656: Examples of a publisher include a person, an organization, or a

657: service.

658:

659: \item[\elt{Relation}:]

660: {\bf A reference to a related resource.}

661: This element is used to document relationships between resources.

662: The \attr{refine} attribute is used to refine the nature of the

663: relationship using values from a controlled vocabulary (for instance, is

664: replaced by, requires, is part of, and so on). The vocabulary for refinements

665: to Relation is defined in \cite{DCQ00}.

666:

667: \item[\elt{Rights}:]

668: {\bf Information about rights held in and over the resource.}

669: Typically, a \elt{Rights} element will contain a rights management

670: statement for the resource, or reference a service providing such information.

671: Rights information often encompasses intellectual property rights (IPR),

672: copyright, and various property rights.

673: The \attr{code} attribute is used to make a summary statement

674: about rights using the controlled vocabulary OLAC-Rights.

675:

676: \item[\elt{Rights.software}:]

677: {\bf Information about rights held in and over a software resource.}

678: A rights statement pertaining to software, using the controlled

679: vocabulary OLAC-Software-Rights.

680:

681: \item[\elt{Source}:]

682: {\bf A reference to a resource from which the present resource is derived.}

683: For instance, it

684: may be the bibliographic information about a printed book of which this is the

685: electronic encoding or from which the information was extracted.

686:

687: \item[\elt{Subject}:]

688: {\bf The topic of the content of the resource.}

689: Typically, a Subject will be expressed as keywords, key phrases or

690: classification codes that describe a topic of the resource. Recommended best

691: practice is to select a value from a controlled vocabulary or formal

692: classification scheme.

693:

694: \item[\elt{Subject.language}:]

695: {\bf A language which the content of the resource describes or discusses.}

696: As with the Language element, a \attr{code} attribute is

697: used to identify the language precisely.

698:

699: \item[\elt{Title}:]

700: {\bf A name given to the resource.}

701: Typically, a title will be a name by which the resource is formally known.

702: A translation of the title can be supplied in a second \elt{Title} element.

703: The \attr{lang} attribute is used to identify the language of these elements.

704:

705: \item[\elt{Type}:]

706: {\bf The nature or genre of the content of the resource.}

707: The \attr{code} attribute is used to identify the type using the

708: Dublin Core controlled vocabulary DC-Type.

709:

710: \item[\elt{Type.data}:]

711: {\bf The nature or genre of the content of the resource, from a linguistic

712: standpoint.}

713: Type includes terms describing general categories, functions, genres,

714: or aggregation levels for content.

715: The \attr{code} attribute is used to identify the type using the

716: controlled vocabulary OLAC-Data.

717:

718: \item[\elt{Type.functionality}:]

719: {\bf The functionality of a software resource.}

720: The \attr{code} attribute is used to identify the type using the

721: controlled vocabulary OLAC-Functionality.

722: \end{description}

723:

724: Observe that some elements, such as \elt{Format}, \elt{Format.encoding}

725: and \elt{Format.markup}

726: are applicable to software as well as to data.  Service providers can exploit

727: this feature to match data with appropriate software tools.

728:

729: \subsection{The controlled vocabularies}

730: \label{sec:cv}

731:

732: Controlled vocabularies are enumerations of legal values for the

733: \attr{code} attribute.  In some cases, more than one value applies,

734: in which case the corresponding element must be repeated, once for each

735: applicable value.  In other cases, no value is applicable ands

736: the corresponding element is simply omitted.  In yet other cases, the

737: controlled vocabulary may fail to provide a suitable item, in which case

738: a similar item can be optionally specified and a prose comment included in the

739: element content.

740:

741: \subsubsection{OLAC-Language}

742:

743: Language identification is an important dimension of language resource

744: classification. However, the character-string representation of language names

745: is problematic for several reasons:

746: different languages (in different parts of the world) may have the

747: same name;

748: the same language may have a different name in each country where

749: it is spoken;

750: within the same country, the preferred name for a language may

751: change over time;

752: in the early history of discovering new languages (before names

753: were standardized), different people referred to the same language by different

754: names; and

755: for languages having non-Roman orthographies, the language name

756: may have several possible romanizations.

757: Together, these facts suggest that a standard based

758: on names will not work.

759: Instead, we need a standard based on unique identifiers

760: that do not change, combined with accessible documentation that

761: clarifies the particular speech variety denoted by each identifier.

762:

763: The information technology community has a standard for language

764: identification, namely, ISO 639 \cite{ISO639}. Part 1 of this standard

765: lists two-letter codes for identifying 160 of the world's major

766: languages; part 2 of the standard lists three-letter codes for identifying

767: about 400 languages. ISO 639 in turn forms the core of another standard, RFC

768: 3066 (formerly RFC 1766), which is the

769: standard used for language identification in the xml:lang attribute of XML and

770: in the language element of the Dublin Core metadata set.  RFC 3066

771: provides a mechanism for users to register new language identification codes

772: for languages not covered by ISO 639, but very few additional languages have

773: been registered.

774:

775: Unfortunately, the existing standard falls far short of meeting the

776: needs of the language resources community since it fails to account for more

777: than 90\% of the world's languages, and it fails to adequately document what

778: languages the codes refer to \cite{Simons00}. However, SIL's Ethnologue

779: \cite{Grimes00} provides a complete system of language identifiers which

780: is openly available on the Web. OLAC will employ the RFC 3066 extension

781: mechanism to build additional language identifiers based on the Ethnologue

782: codes.  For the 130-plus ISO-639-1 codes having a one-to-one mapping onto

783: Ethnologue codes, OLAC will support both.  Where an ISO code is ambiguous

784: -- such as \code{mhk} for ``other Mon Khmer languages'' --

785: OLAC will require the Ethnologue code.

786: New identifiers for ancient languages, currently being developed by

787: LINGUIST List, will be incorporated.

788: These language identifiers are expressed using the \attr{code} attribute of the

789: \elt{Language} and \elt{Subject.language} elements.

790: The free-text content of these elements may be used to specify an

791: alternative human-readable name for the language (where the name

792: specified by the standard is unacceptable for some reason)

793: or to specify a dialect (where the resource is dialect-specific).

794:

795: \subsubsection{OLAC-Data}

796:

797: After language identification, another dimension of central importance is

798: the linguistic type of a resource.  Notions such as ``lexicon'' and

799: ``grammar'' are fundamental to OLAC, and the discourse of the

800: language resources community depends on shared assumptions about what

801: these types mean.

802:

803: We believe that it is helpful to distinguish at least four top-level types:

804: \code{transcription}, \code{annotation}, \code{description} and

805: \code{lexicon}, each defined broadly as proposed below.

806: A \code{transcription} is

807: any time-ordered symbolic representation of a linguistic event.

808: An \code{annotation} is any kind of structured linguistic information that is

809: explicitly aligned to some spatial and/or temporal

810: extent of a linguistic record (such as a recorded signal or an image).

811: A \code{description} is any description or analysis of a language; unlike a

812: transcription or an annotation, the structure of a

813: description is independent of the structure of the

814: linguistic events that it describes.

815: A \code{lexicon} is any record-structured inventory of linguistic forms.

816:

817: For each of these top-level types we envision a more specific vocabulary

818: to facilitate greater precision.  For example, an orthographic

819: transcription would have the code \code{transcription/orthographic}.

820: Other subtypes could include: \code{phonetic}, \code{prosodic},

821: \code{morphological}, \code{gestural}, \code{part-of-speech},

822: \code{syntactic}, \code{discourse}, \code{musical}.  The \code{annotation}

823: type would include these subtypes, and add others

824: to cover spatial annotation of images (e.g. for OCR annotation of textual

825: images or for isogloss maps).

826:

827: The \code{description} type could have subtypes for

828: \code{grammatical}, \code{phonological}, \code{orthographic},

829: \code{paradigms}, \code{pedagogical}, \code{dialectal} and

830: \code{comparative}.  The \code{lexicon} type could also carry

831: subtypes to distinguish wordlists, wordnets, thesauri and

832: so forth.

833:

834: \subsubsection{Other controlled vocabularies}

835:

836: \begin{description}\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}

837: \item[OLAC-CPU:]

838: A vocabulary for identifying the CPU(s) for which the software is

839: available, in the case of binary distributions:

840: \code{x86}, \code{mips}, \code{alpha}, \code{ppc}, \code{sparc}, \code{680x0}.

841:

842: \item[OLAC-Encoding:]

843: A vocabulary for identifying the character encoding used by a digital

844: resource, e.g. \code{iso-8859-1}, ...

845:

846: \begin{figure*}[tb]

847: {\small\begin{verbatim}

848: <?xml version="1.0" encoding="UTF-8"?>

849: <olac

850:   xmlns="http://www.language-archives.org/OLAC/0.3/"

851:   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

852:   xsi:schemaLocation="http://www.language-archives.org/OLAC/0.3/

853:                 http://www.language-archives.org/OLAC/olac-0.3b1.xsd">

854:   <Title>KPML</Title>

855:   <Identifier>http://www.purl.org/net/kpml/</Identifier>

856:   <Creator refine="Author">Bateman, John</Creator>

857:   <Subject.language code="es"/> <Subject.language code="ru"/>

858:   <Subject.language code="ja"/> <Subject.language code="el"/>

859:   <Subject.language code="de"/> <Subject.language code="fr"/>

860:   <Subject.language code="en"/> <Subject.language code="cs"/>

861:   <Subject.language code="bg"/>

862:   <Format.os code="MSWindows/winNT"/> <Format.os code="MSWindows/win95"/>

863:   <Format.os code="MSWindows/win98"/> <Format.os code="Unix/Solaris"/>

864:   <Type.functionality>Annotation Tools, Grammars, Lexica, Development Tools,

865:     Formalisms, Theories, Deep Generation, Morphological Generation,

866:     Shallow Generation</type.functionality>

867:   <Relation refine="Requires">Windows: none; Solaris: CommonLisp + CLIM</Relation>

868:   <Description>Natural Language Generation Linguistic Resource Development and

869:     Maintenance workbench for large scale generation grammar development,

870:     teaching, and experimental generation. Based on systemic-functional

871:     linguistics. Descendent of the Penman NLG system.</Description>

872: </olac>

873: \end{verbatim}}

874: \caption{OLAC Metadata Record for KPML}

875: \label{fig:olac-record}

876: \end{figure*}

877:

878: \item[OLAC-Format:]

879: A vocabulary for identifying the manifestation of the resource.

880: The representation is inspired by MIME types, e.g. \code{text/sf} for

881: SIL standard format.  (\elt{Format.markup} is used to identify the particular

882: tagset.)  It may be necessary to add new types and subtypes to cover

883: non-digital holdings, such as manuscripts, microforms, and so forth

884: and we expect to be able to incorporate an existing vocabulary.

885:

886: \item[OLAC-Functionality:]

887: A vocabulary for classifying the functionality of software,

888: again using the MIME style of representation, and using the

889: HLT Survey as a source of categories \cite{Cole97} as advocated

890: by the ACL/DFKI Natural Language Software Registry.  For example,

891: \code{written/OCR} would cover ``written language input, print or

892: handwriting optical character recognition.''

893:

894: \item[OLAC-OS:]

895: A vocabulary for identifying the operating system(s) for which the software

896: is available:

897: \code{Unix}, \code{MacOS}, \code{OS2}, \code{MSDOS}, \code{MSWindows}.

898: Each of these has optional subtypes, e.g.

899: \code{Unix/Linux}, \code{MSWindows/winNT}.

900:

901: \item[OLAC-Rights:]

902: A vocabulary for classifying the rights held over a resource, e.g.:

903: \code{open}, \code{restricted}, ...

904:

905: \item[OLAC-Role:]

906: A vocabulary for identifying the role of a contributor or creator of the

907: resource, e.g.: \code{author}, \code{editor}, \code{translator},

908: \code{transcriber}, \code{sponsor}, ...

909:

910: \item[OLAC-Software-Rights:]

911: A vocabulary for classifying the rights held over a resource, e.g.:

912: \code{open-source}, \code{royalty-free-library},

913: \code{royalty-free-binary}, \code{commercial}, ...

914:

915: \item[OLAC-Sourcecode:]

916: A vocabulary for identifying the programming language(s) used by

917: software which is distributed in source form, e.g.:

918: \code{C++}, \code{Java}, \code{Python}, \code{Tcl}, \code{VB}, ...

919:

920: \end{description}

921:

922: \section{XML Representation}

923:

924: The OLAC metadata format consists of an XML schema for the element

925: set, and a set of schemas for the controlled vocabularies.  The

926: latest versions are available from the OLAC website.

927:

928: Figure~\ref{fig:olac-record} shows the OLAC metadata record

929: corresponding to the KPML display from Figure~\ref{fig:sp}.

930: The top element is \elt{olac}; this references the XML namespace

931: for version 0.3b1 of the schema.  The contents of the \elt{olac}

932: element are the OLAC metadata elements, which are optional and repeatable,

933: and can occur in any order, as in Dublin Core.

934:

935: Some elements employ the optional \attr{code} or \attr{refine} attributes,

936: and/or free-text content.  The third attribute, \attr{lang}, is not used

937: here since the free-text content is in English (specified in the XML

938: schema as the default).  For the \elt{Creator} element, the \attr{refine}

939: attribute narrows the meaning of creator to \code{Author}.  For the

940: \elt{Subject.language} elements, the \attr{code} attribute specifies

941: nine languages using Ethnologue codes.  A service provider would map these

942: codes to human-readable names.

943:

944: The \elt{Format.os} element illustrates a two-level coding scheme,

945: consisting of an OS ``family'', followed by a specific operating

946: system.  Further details can be included in the free-text content if

947: necessary.  If a piece of software runs on all members of an OS family,

948: then the more detailed designation can be omitted, e.g. \attr{code="Unix"}.

949: The \elt{Type.functionality} element is specified using free-text content,

950: since the details of the controlled vocabulary OLAC-Functionality are still

951: being worked out.

952:

953: \section{Conclusions}

954:

955: The OLAC Metadata Set and controlled vocabularies are works in progress,

956: and are continuing to be

957: revised with input from participating archives and members

958: of the wider language resources community.  We hope to have provided

959: sufficient motivation and exemplification for our choices so that readers

960: will easily be able to contribute to ongoing developments.

961:

962: Even once OLAC is completely in place, there will still be documentation tasks

963: which the creators of language resources will have to undertake, and new habits to

964: acquire.  It will always be necessary to identify and manually correct

965: inconsistent or erroneous metadata.  The OLAC controlled vocabularies will

966: need to be refined indefinitely in response to changes in the world around us.

967: The creators of language resources will need to

968: generate metadata with each new resource and place the resource in a

969: suitable archive.  The communities will need to adopt best practices for

970: archival storage formats.

971:

972: Despite these intrinsic limitations,

973: the OLAC Metadata Set and controlled vocabularies offer a \emph{template}

974: for resource description, providing two clear benefits over traditional

975: full-text description and retrieval.  First, the template guides the

976: resource creator in giving a \emph{complete description} of the resource,

977: in contrast to prose descriptions which may omit important details.

978: And second, the template associates a resource with \emph{standard labels},

979: such as \elt{creator} and \elt{title}, permitting users to do focussed

980: searching.

981: Resources and repositories can proliferate, yet common metadata and

982: vocabularies will support centralized services giving users easy access to

983: language resources.

984:

985: \raggedright\small

986: \bibliographystyle{acl}

987:

988: \begin{thebibliography}{}

989:

990: \bibitem[\protect\citename{Alvestrand}2001]{Alvestrand01}

991: Harald Alvestrand.

992: \newblock 2001.

993: \newblock {RFC} 3066: Tags for the identification of languages (replaces 1766).

994: \newblock \url{ftp://ftp.isi.edu/in-notes/rfc3066.txt}.

995:

996: \bibitem[\protect\citename{B\'anik and Bird}2001]{BanikBird01}

997: \'Eva B\'anik and Steven Bird.

998: \newblock 2001.

999: \newblock {LDC} experimental {OLAC} service provider.

1000: \newblock \url{http://wave.ldc.upenn.edu/OLAC/sp-0.2/sp.php4}.

1001:

1002: \bibitem[\protect\citename{Cole}1997]{Cole97}

1003: Ronald Cole, editor.

1004: \newblock 1997.

1005: \newblock {\em Survey of the State of the Art in Human Language Technology}.

1006: \newblock Studies in Natural Language Processing. Cambridge University Press.

1007: \newblock \url{http://cslu.cse.ogi.edu/HLTsurvey/}.

1008:

1009: \bibitem[\protect\citename{{DCMI}}1999]{DCMES99}

1010: {DCMI}.

1011: \newblock 1999.

1012: \newblock {Dublin Core Metadata Element Set}, version 1.1: Reference

1013:   description.

1014: \newblock \url{http://dublincore.org/documents/1999/07/02/dces/}.

1015:

1016: \bibitem[\protect\citename{{DCMI}}2000a]{DCQ00}

1017: {DCMI}.

1018: \newblock 2000a.

1019: \newblock {Dublin Core} qualifiers.

1020: \newblock \url{http://dublincore.org/documents/2000/07/11/dcmes-qualifiers/}.

1021:

1022: \bibitem[\protect\citename{{DCMI}}2000b]{DCQHTML00}

1023: {DCMI}.

1024: \newblock 2000b.

1025: \newblock Recording qualified {Dublin Core} metadata in {HTML}.

1026: \newblock \url{http://dublincore.org/documents/2000/08/15/dcq-html/}.

1027:

1028: \bibitem[\protect\citename{Grimes}2000]{Grimes00}

1029: Barbara~F. Grimes, editor.

1030: \newblock 2000.

1031: \newblock {\em Ethnologue: Languages of the World}.

1032: \newblock Dallas: Summer Institute of Linguistics, 14th edition.

1033: \newblock \url{http//www.sil.org/ethnologue/}.

1034:

1035: \bibitem[\protect\citename{{ISO}}1998]{ISO639}

1036: {ISO}.

1037: \newblock 1998.

1038: \newblock {ISO} 639: Codes for the representation of names of languages-part 2:

1039:   Alpha-3 code.

1040: \newblock \url{http://lcweb.loc.gov/standards/iso639-2/langhome.html}.

1041:

1042: \bibitem[\protect\citename{Lagoze and de Sompel}2001]{LagozeVandeSompel01}

1043: Carl Lagoze and Herbert~Van de~Sompel.

1044: \newblock 2001.

1045: \newblock The {Open Archives Initiative}: Building a low-barrier

1046:   interoperability framework.

1047: \newblock \url{http://www.cs.cornell.edu/lagoze/papers/oai-jcdl.pdf}.

1048:

1049: \bibitem[\protect\citename{{MPI ISLE Team}}2000]{IMDI00}

1050: {MPI ISLE Team}.

1051: \newblock 2000.

1052: \newblock {ISLE} meta data elements for session descriptions proposal.

1053: \newblock

1054:   \url{http://www.mpi.nl/world/ISLE/documents/draft/ISLE_Metadata_2.0.pdf}.

1055:

1056: \bibitem[\protect\citename{Simons}2000]{Simons00}

1057: Gary Simons.

1058: \newblock 2000.

1059: \newblock Language identification in metadata descriptions of language archive

1060:   holdings.

1061: \newblock In Steven Bird and Gary Simons, editors, {\em Proceedings of the

1062:   Workshop on Web-Based Language Documentation and Description}.

1063: \newblock \url{http://www.ldc.upenn.edu/exploration/expl2000/papers/simons/}.

1064:

1065: \bibitem[\protect\citename{Svenonius}2000]{Svenonius00}

1066: Elaine Svenonius.

1067: \newblock 2000.

1068: \newblock {\em The Intellectual Foundation of Information Organization}.

1069: \newblock The MIT Press.

1070:

1071: \end{thebibliography}

1072:

1073: \end{document}

1074: