Putting Xml to Work in the Library: Tools for Improving Access and Management

Putting XML Work Library ▼ to in the Tools for Improving Access and Management DICK R. MILLER AND KEVIN S. CLARK...

Author: Dick R. Miller

11 downloads 752 Views 2MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Putting

XML Work Library ▼

to

in the

Tools for Improving Access and Management

DICK R. MILLER

AND

KEVIN S. CLARKE

AMERICAN LIBRARY ASSOCIATION Chicago

2004

While extensive effort has gone into ensuring the reliability of information appearing in this book, the publisher makes no warranty, express or implied, on the accuracy or reliability of the information, and does not assume and hereby disclaims any liability to any person for any loss or damage caused by errors or omissions in this publication. Trademarked names appear in the text of this book. Rather than identify or insert a trademark symbol at the appearance of each name, the authors and the American Library Association state that the names are used for editorial purposes exclusively, to the ultimate benefit of the owners of the trademarks. There is absolutely no intention of infringement on the rights of the trademark owners. Index by Janet Russell Composition and design by ALA Editions in Times New Roman and Univers 55, using QuarkXPress 5.0 on a PC platform The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences—Permanence of Paper for Printed Library Materials, ANSI Z39.48-1992. ⬁ Library of Congress Cataloging-in-Publication Data Miller, Dick R. Putting XML to work in the library : tools for improving access and management / by Dick R. Miller and Kevin S. Clarke. p. cm. Includes index. ISBN 0-8389-0863-2 (alk. paper) 1. XML (Document markup language) 2. Libraries—Data processing. 3. Cataloging—Data processing. 4. Digital libraries. I. Clarke, Kevin S. II. Title. Z678.93.X54M55 2003 005.7'2—dc21 2003013916 Copyright © 2004 by the American Library Association. All rights reserved except those which may be granted by Sections 107 and 108 of the Copyright Revision Act of 1976. Printed in the United States of America 08

07

06

05

04

5

4

3

2

1

For Love For Joy For Spirit For love, companionship, and patience For holding my hand, laughter, and big hugs

This page intentionally left blank

Contents

1

Figures

ix

Introduction

xi

It’s Elemental, My Dear Watson

1

What Is XML?

1

Markup Minihistory: Prelude to XML

2

Picking Up the Pieces

3

Extensible Markup Syntax

3

Markup and XML

4

Types of Markup

6

Elements and Structure

6

Elements

8

Attributes

14

Empty Elements

16

Mixed Content

17

Entities

18

Comments

20

Character Data Sections

20

White Space

21

Character Encoding

22

The XML Declaration

23

Processing Instructions

26

Well-Formed Documents

26

Putting It All Together

27

v

Contents

vi

All in the Family Valid Documents

30

Document Type Definitions

31

XML-Based Schemas

31

Namespaces

32

Stylesheets

33

XHTML

33

Browsers and Viewing

35

X Whatever

36

Libraries’ Strategic Opportunity Generic Aspects of XML

2

30

“The Nice Thing about Standards . . .”

36 38

44

Getting Started

45

What’s in a Name(space)?

45

I Need Some (XML) Validation!

48

DTDs (Document Type Definitions)

49

XML Schemas

54

RELAX NG Schemas

59

Where Do I Go from Here?

67

XPath

68

XLink

70

XPointer

73

Doing It with Style(sheets)

76

Getting Started

77

XSLT Stylesheets

77

XSL FO Stylesheets

82

CSS Stylesheets

85

Contents

3

In the Scheme of Things Schema Development The Development Process Step-by-Step

Library Information in Context

4

vii

91 93 94 96

Open Content

96

Open Libraries?

97

Think Globally, Act Locally

99

Bibliographic and Authority Records

101

MARC: The Ultimate Crazy Quilt?

102

AACR: Fixity versus Fluidity?

116

XOBIS: Simplicity without Sacrifice?

136

XML Tools: What Do You Want to Do Today?

145

Open-Source Solutions

145

XML Editors: Mark It Up!

146

Simple Text Editors

147

JEdit

148

XMLOperator

150

BitFlux Editor

153

XML eXchaNGeR (XNGR)

156

XML Transformers: The Changing Face of XML

159

Saxon

159

Cocoon and Xalan

161

Kawa and Qexo

163

XML Browsers: Not Just Another Pretty Face

165

Mozilla

165

Amaya

167

Conclusion

170

viii

5

Contents

The Future Is Now: Trends and Possibilities Trends and Future Standards

173 173

XInclude

174

XForms

174

Scalable Vector Graphics (SVG)

176

DocBook

179

VoiceXML

179

OpenOffice, AbiWord, and Microsoft Word

180

XML Possibilities

181

Transitional E-Journals List

181

Updating MARC with MARCUTL

183

Maintaining PubMed LinkOuts

186

MARC to XOBIS in 2003 and Beyond

188

Conclusion

190

References

191

Index

195

Figures

7

1-1

XML’s Inverted Tree Structure

1-2

Annotated Text-Centric XML Fragment

28

1-3

Annotated Data-Centric XML Document

29

3-1

Possible Library Schemas in a Transitional Environment

100

3-2

Working Definitions of XOBIS Principal Elements and Parallel Relationship Attributes

140

3-3

Basic Organizational Detail of XOBIS

141

3-4

XOBIS Source-Target Relationships

143

4-1

Screenshot of JEdit’s XML Tree Functionality

149

4-2

Screenshot of XMLOperator Editing Data-Centric XML

151

4-3

An HTML Representation of XOBIS, an XML Schema Developed at Lane Library

152

4-4

WYSIWYG Editing with the BitFlux Editor

154

4-5

The BitFlux Editor’s Character Selection Tool

155

4-6

XML eXchaNGeR’s Explorer

157

4-7

Active SVG Batik Module Running in XML eXchaNGeR

158

4-8

XML Transformed Using Saxon and an XSLT Stylesheet

160

4-9

MathML Displayed in the Amaya Editor/Browser

168

4-10

Annotations Displayed in the Amaya Editor/Browser

169

5-1

Black-and-White Version of SVG Image

176

ix

This page intentionally left blank

Introduction

ibraries, places of order, reason, and reflection, are often where people go to seek help in finding information beyond their familiar comfort zones. Today the World Wide Web tempts both users and librarians to click their way to unevaluated, disorganized, and often arbitrary results. Quality, coverage, balance, and retrieval limitations concern librarians, but users increasingly follow the path of least resistance to alluring instant information. Librarians, faced with changing user expectations, are feeling a little uncomfortable. Some are shaken by these seemingly ominous developments and wonder about the role of the library and their careers in the emerging digital environment. Others are concerned about the disorder, lack of control, unbridled competition, impermanence, imprecision, glitzy superficiality, and commercialization of the digital environment. New technologies raise difficult issues and challenge the status quo. Despite the resulting discomfort, this can be constructive. The emergence of the Web is an upheaval of the type described in the 1962 classic The Structure of Scientific Revolutions (Kuhn 1996). However, this particular paradigm shift hits libraries and librarians closer to home because it has fundamentally changed the way that information is perceived and how it can be shared. The Web’s ascendancy has made information a hot commodity. Librarians, more than others, have reason to be excited because these developments meld so well with our traditional interests. Webassociated technologies provide a wealth of opportunities for investigating what ails information management and for finding better ways to address these problems amid the digital turbulence. Librarians, like researchers who would find the cure for AIDS or cancer, would not be put out of business by their success in eliminating problems, but would gain enhanced professional stature and encounter new opportunities and support for tackling future challenges. Among web technologies, XML (Extensible Markup Language) stands out as strategically significant to libraries. First and foremost, XML’s syntax, which encourages semantic markup, serves as a foundation for the development of web-based information systems. This applies at multiple levels, from raw documents to the machinery of sophisticated, interactive web interfaces. The full-text content of digital documents can use the same syntax as is used in separate metadata records. Furthermore, the many associated technologies designed specifically for XML facilitate flexible document and data management, processing, merger, presentation, etc. Each of these technologies is optimized for dealing with specific aspects of efficiently managing information on the Web. Harmony results from XML’s shared syntax and the strength of its arsenal of tools.

L

xi

xii

Introduction

As a markup language, XML goes far beyond the display markup that has been vital to the Web’s success. HTML’s emphasis on the display properties of markup elements has limited the effectiveness of string and keyword searching. Librarians appreciate the power that full-text indexing and ubiquitous access bring to otherwise dispersed and inaccessible information, but they also immediately recognize limitations that are less obvious to the untrained eye. Savvy librarians can help users find the proverbial needle in the haystack. However, as the haystack continues to grow (and haystacks multiply), librarians also need to focus on strategies for improving access to that portion of the digital resources which really matters—the educational, scientific, cultural, and historically significant materials that have traditionally been the concern of archives, libraries, and museums. XML can serve as a vehicle to address many of the shortcomings of web access. Consider what the search terms in these columns might have in common: Ballet

Clark

Baltimore

Dick

Brain

George

Hand

Hannah

Human

Jay

Lasagna

John

Medicine

Joyce

Nightingale

Rose

Nurse

Thomas

The terms in the first column could have many associations, or none, depending on how and where they are used in documents. Which of these usages will interest a particular user further depends on the meaning sought. The “success” of a given retrieval depends on many factors, but certainly the elimination of so-called “false drops” would be a considerable improvement. Specifying that a query represents a personal surname would alter the appropriateness of many keyword hits. The terms in the first column above actually represent personal surnames: Ballet, Gilbert, 1853–1916 Baltimore, David, 1938– Brain, W. Russell Brain (Walter Russell Brain), Baron, 1895–1966 Hand, Alfred, 1868–1949 Human, J. U. (John Urban) Lasagna, Louis, 1923– Medicine, Anne, 1934–2002

Introduction

xiii

Nightingale, Florence, 1820–1910 Nurse, Paul, 1949– There are currently ways to improve keyword-based retrieval, but the results vary significantly and can be unpredictable, depending on variations occurring in text. Seeking the terms in the second column above on the Web is challenging, even when you know that they too are personal surnames: Clark, Dick, 1929– Dick, Gladys Henry, 1881–1963 George, Phyllis, 1949– Hannah, Daryl, 1960– Jay, John, 1745–1829 John, Elton, 1947– Joyce, James, 1882–1941 Rose, Charlie, 1942– Thomas, Alma, 1891–1978 Knowing whether a forename or surname is present would further narrow the potential result, if such distinctions were made in the markup of documents posted on the Web. Knowing that 1949 is a birth date, publication date, or date of coverage would also make a big difference in controlling search results. Many web documents, even those produced by libraries, go undated. The freedom of the Web is one of its attractions, but for serious information retrieval, a little control is needed in order to maximize its benefits. XML permits, indeed encourages, document markup to make these types of distinctions, in order to support more refined searching and to introduce flexibility into the processing and display of such data. XML is valuable for more than technically underpinning web enhancements. It can be applied in all areas of information management in libraries, unlike MARC’s narrow focus on cataloging. Defining XML markup, which is essentially an exercise in understanding data structures and their relationships, can result in deeper insights into how various types of information and systems interact in broader contexts. XML thus provides a sort of catalyst for building future information systems, ones with greater flexibility, generality, and sophistication. However, lest the lessons of the past be ignored, new standards for structuring digital information will also require the thoughtful reassessment of traditional practices. The web environment offers new opportunities to leverage librarians’ combined efforts toward a more significant and effective role in a turbulent digital climate. Libraries have the advantage of being cooperative. A loose network of libraries, archives, and museums circles the globe and covers all subject areas. With greater effort to eliminate artificial barriers to access and minimize redundant work, our collective

xiv

Introduction

efforts could significantly improve the outlook for libraries on the Web. We have a cadre of librarians who provide quality filtering of web resources. Instead of the current fragmentation, our records for these resources, and the content that libraries prepare for the Web, could be better designed for integrated, distributed, and open access. We have the talent, raw materials, and XML. The possibilities are tantalizing. Libraries have long represented a gift culture. Instead of developing incompatible islands of organization with arbitrary boundaries, segregated user communities, and confusing search variations, we could accelerate the movement away from library resource ownership. An open-source software movement is well under way. The library and academic communities are beginning to make headway regarding unrestricted access to open scholarly content. Carefully crafted XML solutions could go a long way toward breaking down technical barriers in the structure of information, and thus grease the machinery of the Web for information that matters. Despite lacking a crystal ball, we will endeavor in this book to stir interest in XML and the role it can play in bringing products of the imagination closer to reality in libraries. The topics covered range from those of interest to beginners to some that are cutting edge. As libraries use XML more extensively, opportunities for collaboration and synthesis are bound to emerge. XML holds the promise that with broader use, libraries will one day be poised to offer much more than the sum of their individual efforts. Such success may well lie in the degree to which we can establish XML-based information standards. Achieving these can only come with expanded XML experience in libraries. The authors hope that this book contributes to that prerequisite.


The best thing about XML is that it is so flexible. The worst thing about XML is that it is so flexible.

ML is fast becoming a foundational technology for the World Wide Web. The accelerating pace of its acceptance as the universal format for data and document exchange continues unabated. Due in large part to its simplicity, flexibility, and stability, XML provides an unrivaled mechanism to break down impediments to effective communication in the Web’s wildly diverse environment. The novel uses of XML have surprised even its developers. Librarians should find XML particularly attractive on account of its potential to introduce more order into this chaotic environment, permitting improved system interfaces, enhanced information retrieval, and greater integration of the disparate digital resources that are needed in order to serve the growing expectations of diverse user populations. As part of this book’s effort to elucidate XML’s value to the library community, this chapter explains the essentials of XML upon which the following chapters will rely.

X

WHAT IS XML? XML (Extensible Markup Language) is a system for electronically tagging or marking up documents in order to label, organize, and categorize their content. The content in a document can consist of words, data, images, and so on. XML provides a standard method by which to structure content and the differing functions that content performs in a document. XML was created for the effective exchange of richly structured documents and data over the Web. XML is related to HTML (Hypertext Markup Language), a simpler, less flexible language that was originally created to deliver text documents over the Web. HTML uses a limited, fixed set of tags to indicate basic text elements such as paragraph breaks, headings, fonts, etc., for electronic transmission and display. The predefined tag set of HTML was useful because of its simplicity, but it also eventually proved rather 1

2


limiting, since the tag set could only be expanded to include new markup elements by general agreement. HTML has thus gone through successive expansions (i.e., versions) of its tag set when it repeatedly proved inadequate for structuring the increasingly complex documents being exchanged over the Web. XML differs fundamentally from HTML in that it specifies neither a tag set (vocabulary) nor a system governing the meaning of particular tags (semantics). Instead, XML provides a set of specifications within which different publishers, authors, and document producers can create their own tags as a means to describe and organize their own particular content. As long as certain rules are obeyed, these tags can be read and processed correctly by any web browser no matter what computer system or software was used to create them. Both XML and HTML are derived from SGML (Standard Generalized Markup Language). SGML is a complex and powerful meta-markup language, i.e., one created to define other markup languages. SGML’s enormous breadth and flexibility have made it the standard way to structure documents in a manner independent of proprietary software codes and computer systems, but these same virtues made SGML difficult to implement and slow and unwieldy to use for the exchange of documents over the Web. XML is a subset of SGML that was developed for the faster delivery of structured content over the Web. Before delving further into the essentials of XML, a peek at its origins will help set the stage for the sections that follow on how it works.

MARKUP MINIHISTORY: PRELUDE TO XML With its antecedents in the Graphic Communication Association’s GenCode developed in the late 1960s, the Generalized Markup Language (GML) resulted from the design efforts of Charles Goldfarb, Edward Mosher, and Raymond Lorie at IBM in the 1970s to demonstrate the usefulness of generic coding in document management. In the early 1980s, standardization efforts led to SGML, which further facilitated the production and sharing of electronic documents. In 1986 the International Organization for Standardization (ISO) ratified a standard for SGML. SGML solidified the convention of using strings of textual characters enclosed in angle brackets, i.e., tags, to demarcate other segments of text, which is useful in publishing and sophisticated text processing. The need arose for a method of displaying documents as pages for the nascent World Wide Web (WWW), and in 1990 Tim Berners-Lee and Anders Berglund developed an application of SGML—in effect, a subset using a prescribed set of tags—called Hypertext Markup Language (HTML). This vividly demonstrated the power of markup. HTML was standardized as HTML 2.0 in 1995, and continuing improvements culminated in HTML 4.0 in 1997. HTML uses a set of predefined tags and other instructions to specify organizational and display elements such as paragraph breaks, fonts, margins, headings, columns, and so on. This presentational orientation fit the bill for creating web pages, but not for much more.


3

Increased experience led to increasing expectations. While HTML markup spread like wildfire, its limited set of tags caused developers to turn to SGML’s greater flexibility and comprehensiveness in attempting to deliver more elaborate content via the rapidly growing capabilities of improved web browsers. But the shoe did not fit. The World Wide Web Consortium (W3C), founded in 1994 to establish standards for the WWW, set up a working group in 1996 (including Jon Bosak, Tim Bray, Michael Sperberg-McQueen, James Clark, and others) to develop a slimmer version of SGML, one more attuned to web browsers’ capabilities, but retaining its power and flexibility. In 1998 their simplification effort resulted in the W3C’s recommendation: Extensible Markup Language (XML), version 1.0. A second edition, correcting minor errors, appeared in 2000, and the review period for the Candidate Recommendation of version 1.1 ended in February 2003 (W3C 2002). XML, a fully conforming derivative subset of SGML, retained most of SGML’s power and flexibility but removed many of its redundant, confusing, and overly complicated features, as well as those which had not proven useful in twenty years of use. Much of XML’s power and flexibility derive from its being a meta-markup language for text documents. Instead of establishing a predefined set of tags like HTML does, XML defines a framework for extensibly defining (i.e., creating) tags to describe and structure the content of documents, leaving their appearance or formatting to be handled separately. XML’s rapid adoption rate mirrors that of HTML. The comparatively sloppy HTML 4.0 was almost immediately subject to unfavorable comparisons with the rigor and regularity of XML. In 1999 HTML was reformulated as XHTML, using XML’s syntax. Although designed as a successor to HTML, it will necessarily take time for XHTML to assume this role. XML has also proven a viable substitute for SGML in cases where the latter’s enormous vocabulary was overkill. Despite XML’s simplicity, it retained most of SGML’s consistency, flexibility, and generality, the source of its power. The unexpectedly enthusiastic adoption of XML for data-intensive applications, particularly in the transmission of data as XML fragments, underscores its flexibility. At the rate XML is developing, markup may soon be taught in elementary school, most appropriately, along with the alphabet.

PICKING UP THE PIECES Extensible Markup Syntax XML’s success in data management stems partly from the parallel between markup and the use of fields in database records. Similarly, the markup of text provides added value to the user, much as indexing makes the content of a periodical known to researchers. Unindexed content effectively means unavailable content, and in the burgeoning digital world, not-marked-up content means virtually nonexistent content to many users. The format-oriented markup of digital materials à la HTML increasingly equates to unavailability, as documents are lost in a sea of ambiguity, with the good, the

4


bad, and the ugly treated equally without regard to content or quality. Advanced technology and valiant efforts to organize the Web, though impressive, have been thwarted from the outset by the Web’s emphasis on appearance and the lack of an effective method for conveying the meaning or content of a deluge of web pages. Trying to squeeze this additional functionality from anemic HTML documents, despite feature transfusions, has proven frustrating. The developers of XML recognized the preemptive value of adding semantics to syntax at the source: in the documents, as created, rather than as an ex post facto activity. Not only would this help improve the retrieval of documents, but it also underpins more advanced functionality in web-based software applications. Furthermore, XML provided the potential for integrating ordinary web resources with heretoforesegregated database content, sometimes called “dark data.” As databases have proliferated, interfaces have also proliferated, along with the necessity of making repeated separate searches. Attempts to bridge the poorly matched components of disparate systems have met with only limited success. With some planning, XML-based databases, and XML output from traditional databases, permit the seamless merger of search results. XML’s role is a fundamental, integrative one at the infrastructure level. Database fielding may be considered the first markup language for computers. It brought order and control to highly regular information, and the resulting predictability facilitated the development of effective retrieval and manipulation software. With little emphasis on presentation, ugly printouts were de rigueur. Somewhat differently, office systems were developed to improve the efficiency of the production and management of textual documents—with an emphasis on their format and presentation. In both cases, proprietary solutions have reigned until recently. This is changing with greater recognition of the value of keeping the Web an open forum and a level playing field. Concomitantly, there is an increasing awareness of a world growing smaller and the greater need to share information more effectively. There is also recognition that one size does not necessarily fit all, and that marked-up information can be reformatted to serve a wider variety of needs more efficiently than generating separate, difficult-to-synchronize documents in each case. Such challenges are particularly acute for libraries, with their tradition of serving a broad array of users equally. It is telling that both the open-source software community and businesses large and small have embraced XML with such passion. How can something as simple as markup garner such wide interest and have such broad implications? Should libraries adopt XML with the same passion?

Markup and XML A closer look at markup and XML will help answer these questions. Most ordinary documents are just a string of characters. Many humans recognize the meaning embedded in various character strings, but computers do not. In order to make selected strings of text addressable for machine processing, a markup convention is necessary to distinguish


5

one group of strings from another. XML is called “extensible” because instead of providing a fixed set of tags to accomplish this goal, it facilitates the creation of various markup conventions. It provides a standard markup syntax—i.e., a way of putting tags together—for doing this, allowing particular tags to be created and extended as needed. (The formal mechanisms for a particular set of such markup conventions to control document structure are covered in the “All in the Family” section later in this chapter.) Saying that a particular text is XML simply means that it uses XML syntax. Remarkably, XML syntax works for marking up all types of documents and data. The result is twofold. In addition to the original strings of text (i.e., content), markup introduces tags to the document. Content and markup are quite different, but have a sort of symbiotic relationship—the textual content serving humans, and the markup serving computers, both of them living together in a mutually beneficial relationship. The most basic type of tag consists of a character or string of characters enclosed within a pair of angle brackets, e.g., . The most common type of markup, called an “element,” is used to define, describe, or identify a piece of content or a section of a document. This is done by enclosing the content within a pair of tags, called start and end (or opening and closing) tags, e.g., Gonzalez. (The start tag is identical to its corresponding end tag except for the slash in the latter.) An element thus consists of start and end tags and the content in between them. Markup can provide meaning to the life of an otherwise lonely, boring string of text. Consider the following seven technically complete XML mini-documents: 1597 1597 1597 1597 1597 1597 1597

The content “1597” in the middle is the “same” in each case, but the tags before and after this value add another dimension to it. Each of these documents is a single element. Even without markup, context often permits humans to ferret out the different meanings of a word or a piece of content. But anyone who has scribbled a note hastily, and found it later with no idea what the message meant, knows the value of markup. One has to wonder how the ubiquitous and consistent use of semantic markup (i.e., markup that identifies the meaning of content) would affect the prospects for improved retrieval on the Web. Consider the impact that the reliable indication of a document’s creation date or its author’s surname would have in winnowing search results. Metadata (i.e., data about data) initiatives are exploring ways of using XML markup to provide documents of permanent value with the improved access they deserve. By contrast, natural language processing has not proven a panacea, and although there have been

6


many impressive advances in free-text searching, librarians remain acutely aware of its limitations and how the problem grows with the increasing scale of the Web. Many are using XML as the vehicle to stem such problems. In addition to describing content itself, XML’s syntax provides built-in features for describing the structure and relationships between the various elements identified within the content. Complex data structures can be built from simple building blocks. XML itself is deceptively easy, but it is sometimes surprising how complex structures can get using such a simple palette. This is one reason why XML is so powerful.

Types of Markup As already noted, all XML documents consist of content and markup. The content and markup in a document are handled in different ways by a parser, which is a program that reads XML documents, checks them for errors, and breaks them up for processing. The principal types of markup include elements, attributes, entity references (or entities), comments, character data (CDATA) sections, document type declarations, and processing instructions. Elements are the basic building blocks of XML and generally serve to identify the content they enclose. Attributes function much like adjectival modifiers, allowing the creation of subtypes of a particular element type. Entity references are used to disguise and allow the parsing of substitutes for some characters reserved by XML, such as < and >, and characters that cannot be keyed in a particular system. Comments are used for inserting into a document materials that are not intended for display or processing, such as editorial notes, explanatory material, and anything else that is not part of the content of the document but is useful in the editorial process. Character data (CDATA) sections are sequences of characters (including markup) that are not parsed. (CDATA contrasts with plain text enclosed between a pair of tags, which is known as PCDATA, or parsed character data.) A document type declaration is a set of rules specifying which elements, attributes, entities, and other markup types are allowed in a particular document. The declaration also specifies the order or sequences in which elements and other tags must appear in order for a parser to process them properly. Processing instructions provide a way for passing instructions to particular computer programs or applications. They are not actually part of the XML document, but the parser is required to send them on to the application. Each of the principal types of markup will be discussed in the subsections following this one.

Elements and Structure To begin exploring the structure of XML documents with multiple elements, it is useful to divide them into two broad categories. (1) Those that consist of narrative text, such as books, articles, memoranda, letters, etc., are referred to as “text-centric.” (2) Those that consist principally of data, such as database records, tabulations, listings, etc., are called “data-centric.” There is no clear boundary between the two, and there are hybrids, such as directories, which are data published in book form. Nonetheless, the distinction will prove useful in describing various features of XML.


7

Although XML documents can look complicated, their structure is rather simple. They almost always consist of two parts: (1) a prolog, which serves to identify a document and may provide information about its structure and processing relationships; and (2) the root element, a pair of tags which enclose and contain all of the document’s actual content. XML documents consist of a hierarchy of named elements. Diagramming the elements of a simple XML document reveals an inverted tree structure. (See figure 1-1.) It begins with the root element “Name” at the top, branches into various other elements (“Person” twice in this case), and ends with leaf elements (“Last,” “First,” and “Dates”) at the bottom for the elements that will contain content values. The various branching points and the leaf elements are both known as “nodes.” The inverted tree shown in the figure can be visualized in various ways. One way is to visualize all of the components in any particular level of the tree as “nesting” within the relevant component on the next level above them. Linearly, a sequence of nested elements in parentheses can express this structure, with each opening parenthesis representing a start tag and each closing parenthesis an end tag. Each lower element starts and ends before the end tag of the element above it in the hierarchy, and is thus “nested” within it: (Name(Person(Last)(First)(Dates))(Person(Last)(First)(Dates)))

Alternatively, an outline structure can represent the identical elements: Name Person Last First Dates Person Last First Dates

Figure 1-1 XML’s Inverted Tree Structure

8


To convert this into an XML document, one needs simply to change each element’s name into a start tag, fill in an appropriate content value for each leaf, and add a reciprocal end tag following each leaf value: Clarke Caitlin 2001– Clarke Kevin 1970–

XML documents usually display in the form above for clarity, although the following raw form conveys the same information to parsers: ClarkeCaitlin2001– ClarkeKevin 1970–

A simple outline structure thus forms the backbone of all XML documents. XML’s power derives from the simplicity of the resulting object-oriented hierarchy. Elements at each level of a hierarchy can be treated as “objects” for processing, carrying along their substructure(s) for the ride. This provides great flexibility and simplifies programming, thus reducing costs. Remember, XML itself does not do anything to the text or data. However, people and computer programs have an easier job doing things with them due to the structured markup.

Elements Elements are the fundamental building blocks of XML and enclose all of the content in a document. They provide a descriptive name for content, identifying what a particular chunk of text or data is; for example, “title,” “color,” “price,” or whatever an author chooses to specify. They divide an XML document into its constituent parts. Unlike HTML, there is no fixed set of values; elements are created as needed for a given situation. Elements may be likened to the fields of a catalog record, although additional elements are used to organize such basic elements hierarchically. The major types of elements include root elements, container elements, and empty elements. A root element is the top-level element that encloses all the other elements comprising a document. Root elements, and any other elements in the hierarchy that have subordinate elements, are called container elements. The number of different levels of elements, and the resulting levels of nesting of subordinate elements within


9

container elements, determine the complexity and specificity of the tagging (and hence the structure) given to a document. An attribute may be added to an element to describe a specific characteristic or property, in the same way that an adjective modifies a noun in order to modify or particularize its character. Unlike most other elements, which enclose a piece of content, empty elements are used in markup without any associated content, and function as markers or placeholders for various purposes. Each of these types of elements will be treated in the subsections that follow, but first the topic of assigning names to elements will be addressed. A tenet of XML is that element names should plainly express the text or information represented. This is why XML has been called “self-describing.” As stated previously, elements are delineated by start and end tags using the markup syntax:

and are distinguished by a contrived pair of identical element names within:

surrounding the actual text or other content: Mercury

In this case, “Mercury” is the content or data value, and “” and “” are markup; the entire string comprises an ordinary XML element. The explicit naming of the element makes it clear that a planet, rather than an automobile, ballet, chemical element, ship, mythological character, etc., is represented here. Names should be clear, but not excessively long. Avoid names that are nonsensical, unpronounceable, difficult to read, or easily confused with other names. XML’s rules for naming elements are simple: • • • •

Names may contain letters, numbers, and other characters. Names must not start with punctuation or a number. Names must not start with xml, XML, Xml, etc. Names cannot contain spaces.

Element names may contain most ordinary characters, including numbers, diacritics, and special characters. Punctuation is a little trickier; the underscore, hyphen, period, and colon are allowable (although the colon has a special significance, as discussed in the “Namespaces” subsection below). Blank spaces and other punctuation marks are prohibited. Do not use characters reserved for markup: angle brackets, ampersand, apostrophe, semicolon, slash, and quotation marks. One way to avoid worrying about the rules is to exclude punctuation and exotic characters from element names entirely. Although tastes vary, these styles of element names are typical:

10


Although technically acceptable, avoid cryptic or wordy styles:

These constructions include serious errors:

Names cannot begin with numbers. The start and end tags’ case must match. Spaces are not permitted. It is best to choose a naming convention and stick with it to avoid constantly having to verify arbitrary format practices. XML’s encouragement of explicitly named elements contributes to its verbosity, making documents and files noticeably longer. Choose distinctive, succinct names and disregard their effect on a document, as length has a negligible impact when processing XML. Someone who is unfamiliar with terse or idiosyncratic naming conventions for elements is more likely to encounter difficulties trying to decipher their meaning than be concerned about their length. After all, naming serves to communicate, and sharing is what the World Wide Web is all about. The following elements partially describing one book seem straightforward: Held, Ray E. Public libraries in California, 1849–1878 University of California Press 1963

However, without considering the context, it is difficult to know if they are wise choices. More generality, such as using “PersonalName” instead of “Author,” and more granularity, such as using “Surname” and “Forename” as child elements of “PersonalName,” would support improved flexibility. Attributes, discussed later in this section, will also color the naming of elements. With the increasing scope of a project, naming elements becomes more of an art. The same name may be used for different purposes, but it is difficult to discuss two elements with the same name without clarifying the context each time. Changing names after they are in production requires considerable effort. Thinking ahead and planning modularly can help prevent element naming from becoming a quagmire, as more


11

complex data structures are built from simple ones. In the following discussions of the relationships of various types of elements and other aspects of XML, keep in mind the value of consistency and the dependencies inherent in XML’s hierarchical structure. The Root Element

Each XML document must have a single root element that contains or encapsulates all the other elements comprising the document or record. Using the inverted tree analogy, the root element is the top element and all other elements are hierarchically subordinate to it. The root element establishes the boundaries of a given document. Its start tag begins the document, and its end tag is the last to occur in the document. All other elements are enclosed between the root element’s start and end tags. It is useful to distinguish narrative texts from database records in one’s choice of root element names. XML’s ability to handle both underscores its flexibility. In textcentric cases the root element’s name usually indicates the type of document involved, whereas in data-centric cases the name reflects a category or grouping of records, potentially constituting a database. The distinction is useful despite the lack of a clear demarcation between them. Instances of these two types of information have inherent structural differences and are also utilized differently, thus calling for differences in the application of various XML features. The following are typical root elements for text-centric documents: . . . . . . . . . . . .

The following are typical root elements for data-centric documents: . . . . . . . . . . . .

The single root element in an XML document is parallel to HTML’s use of . . . to contain an HTML document. XML’s root element is the archetype of a class of elements known as “container elements,” which are discussed next. Container Elements

After the root element has been chosen, creating the remainder of a document’s structure consists of selecting and organizing additional elements to provide as much or as little “granularity” (a term denoting the degree of specificity used in identifying different chunks of content) as required for a particular purpose. XML’s hierarchical structure provides a powerful, generic method to do this.

12


Elements that have subordinate elements, like the root element, are called “container elements.” Container elements are also referred to as “parents.” Contained elements, which are completely enclosed by one parent, are called “children.” Children may be parents to other elements. Elements at the same level within a given container element are known as “siblings.” Siblings share the same parent element. All start and end tags for elements must nest correctly. When an element’s start tag is inside another element, its end tag must also be inside that element. If an XML document has been printed with indentations (i.e., in outline form), proper nesting can be verified visually by drawing a line from each start tag to each end tag. If the lines intersect, the elements are not nested properly. More levels of nesting may appear to add more complexity, but the increased granularity is desirable; include as much depth in a hierarchy as is needed to describe the information properly. The number of levels tends to increase with the scope of a project. Consider the following three ways of marking up one entry for a hypothetical register of cultural events. Electing to make this an “Event” element and lumping all the information is one option: OK MOZART International Festival (18th : 2002 : Bartlesville, Oklahoma)

However, using a more granular approach has advantages. Changing the “Event” element to a container element and naming separate child elements will facilitate searching by “date,” sorting on “place,” etc. Omitting the punctuation offers further flexibility by permitting the data to be reused for various purposes without having to alter it directly. For example, excluding the number and redundant year when preparing a list of events held in 2002 might require different punctuation: OK MOZART International Festival (Bartlesville, Oklahoma).

The separation of content and display is a hallmark of XML. Stylesheets, which are discussed later, permit varying displays of the same data. Considering this, a second option treats “Event” as a container element and omits the punctuation: OK MOZART International Festival 18th 2002 Bartlesville, Oklahoma

The required degree of granularity will vary depending on the intended use of the XML. For larger-scale projects, for the anticipated reuse of content, or simply when in doubt, it is best to provide as many levels of definition as seems reasonable. The additional levels of detail do not hinder manipulating the data at a particular container level. This illustrates XML’s object orientation, in which “Event” can be treated as a single object, or as discrete constituent objects as needed, especially for programming purposes. By contrast, if the detail is not represented separately in the markup, it cannot be


13

referenced. This limits subsequent processing, display, merging with other data, etc. Separate elements can be concatenated (combined) easily, but it may not be possible to split lumped values by algorithm. When considering how many element levels to create, it is important to analyze a wide variety of sample data or representative documents and to consider their potential or unanticipated uses. The old adage about an ounce of prevention being worth a pound of cure was never more relevant. This third option illustrates several levels of container elements: OK MOZART International Festival 18th 2002 Bartlesville Oklahoma

The foregoing is not necessarily the “best” way to mark up this content, as its context is that of an example only. Out of context, markup decisions may appear arbitrary. In the foregoing example, the Year, City, and State elements may conform to what appears to be an invisible structure that is defined separately with the omission of unused elements, possibly:

greater than

&

&

ampersand

'

’

apostrophe

"

”

quotation mark

Entity references have various other uses, all of which involve using an entity to substitute for a literal string of text. For example, they are used to represent special characters that do not belong to the basic character set; they are used as a kind of shorthand to represent strings of characters which occur repeatedly in a document; and they are used to include the content of external files in a document. You can devise your own entity references to perform the aforementioned functions. But unlike the five XML character references already mentioned, devised entities must be predefined—i.e., their content must be associated with a particular string of literal characters—before they are referenced in a document. The content of devised entities is defined using a document structure known as a Document Type Definition (DTD), which will be discussed in a later section of this chapter. Devised entities that represent a literal string of text and are so defined in the document type declaration (DOCTYPE) are called “internal entities.” By contrast, entities that refer to external files or their contents are called “external entities.” External entities allow the contents of another file to be inserted into an XML document at the point of the entity’s occurrence. The definitions of internal entities are stored in the DOCTYPE, and those of external ones are stored separately. Once defined, entity references may be included at appropriate places in text to refer to the entity. Entity references are particularly relevant to libraries because they can be used to represent obscure or exotic characters that cannot by typed directly from one’s keyboard; the usefulness of this in bibliographic control should be obvious. A special class of entities called “named character entities” can represent characters that are not supported by a particular text editor, for example:

nonbreaking space

Þ Þ

uppercase Icelandic thorn

þ

lowercase Icelandic thorn

þ

There are also numeric versions of character entities for referencing the Unicode character set. These come in two types: decimal references and hexadecimal references. The following examples show the decimal and hexadecimal references for the euro, with a “# ” indicating decimal values and a “#x ” indicating hexadecimal. Since XML

20


parsers recognize numeric character references, they do not need to be “declared” (i.e., defined by the author of a document). €

€

decimal value for euro

€

€

hexadecimal value for euro

Despite all this, bear in mind that when using a schema and a text editor supporting Unicode, only XML’s five predefined character entities need be of concern.

Comments Comments are a type of markup that enables authors to insert into a document materials that are not part of the document’s textual content. Comments are not intended for display, except during the editorial process, and they will be ignored by the parser when it processes the document. The inclusion of comments in markup is useful for documenting decisions, providing explanatory notes, flagging areas in need of review, communicating with others working on the same document, “turning off ” or “commenting out” markup temporarily, and generally organizing large documents by providing conspicuous landmarks. XML provides a simple mechanism for this, the comment tag. Comments are enclosed using this start and end syntax:

A few examples:

In order to not confuse parsers, comments must follow a few rules. Comments may contain any text or markup, except the double hyphen -- or a multiple hyphen end, such as --->. Comments may appear anywhere in a document following the beginning XML declaration (discussed later in this chapter), except within other tags or other comments. The following construction with a comment in the start tag of an element is illegal:

Comment

Character Data

Two of these alone, elements and attributes, can be used to build substantial structures. They comprise the bulk of most XML documents and are the focus of most attention in deciding how to mark up a given corpus of text or data. Figure 1-2 illustrates the body of a text-centric XML document. It is an XML fragment, condensed to reveal the overall structure of a novel. This example was taken from the documentation of the Text Encoding Initiative (TEI), which is the de facto standard for scholarly work with digital text (TEI 2001). This fragment of a novel shows its division into two “books” (repeatable “div0” element), each of which is divided into “chapters” (repeatable “div1” element). These subdivisions are identified by XML attributes for “type,” number (“n”), and “id.” Note that the first book has a “head” and “trailer” and that each chapter also has a “head.” Omitted content is indicated by XML comments that are not actually a part of the original markup. The first chapter of the second book illustrates mixed content, indicating

28


“terms” set apart within paragraphs to enhance access to the content. This example illustrates how few elements and attributes may be needed to mark up a large textcentric document. It also formalizes the author’s structure, which is useful for printing and online display and allows flexibility in how these are executed. This and additional markup provide added value, while remaining unobtrusive. Not shown but implicit in this example is adherence to the TEI guidelines, an agreed set of XML markup to provide consistency across documents. The TEI’s promulgation of a flexible, common set of elements and attributes is key to data sharing and longevity, perhaps best stated in their introduction as: “In an online library, data needs to be nimble to survive” (TEI 2001). A second example, figure 1-3, illustrates a data-centric XML document. It is a bibliographic record for a book, with some subjects and description removed to save space. This example was taken from the online documentation of the Metadata Object

Figure 1-2 Annotated Text-Centric XML Fragment


29

Description Schema, or MODS (LC Network 2002c). It may not remain accurate, as the schema is still in development. MODS intends to carry selected data (a subset of fields) from existing MARC 21 bibliographic records and to enable the creation of original resource description records. Following a prolog consisting solely of the XML declaration, the root element “mods” contains the remainder of the record for the book. Some namespaces and the schema location are embedded in its start tag; these will be discussed in the next section of this chapter. The content of MARC’s numeric tags is regrouped using container elements: “titleInfo,” “name,” “publicationInfo,” “subject,” etc. The “publicationInfo” element further illustrates coded data being collocated with text values, e.g., “placeCode”

Figure 1-3 Annotated Data-Centric XML Document

30


and “place.” Attributes are utilized to represent “type,” “authority,” and “encoding” consistently on different elements. Those familiar with MARC will recognize that the statement of responsibility has been converted to a “note” element indicating this fact with a “type” attribute. Some MARC subfields have been consolidated as single XML elements, e.g., “title,” while others have been split into separate fields, e.g., “subject” and “classification.” The amount of markup required to describe this data-centric document is noticeably greater than that of the previous text-centric example, especially considering that this represents only a subset of MARC fields. An XML document rarely exists in isolation. The following section introduces how documents interrelate to each other in the context of XML’s versatile toolkit.

ALL IN THE FAMILY XML is actually at the core of a family of specifications, each of which is optimized to deal with a discrete aspect of document management on the Web. The most important of these adjunct standards are well developed and stable; others exhibit various levels of maturity. That there are so many testifies to XML’s robustness and centrality. This separation of responsibilities makes it indeed a foundation for efficiency. The following subsections serve as an introduction to selected key technologies that are closely associated with XML. Chapter 2 expands on these considerably, while chapter 4 covers related tools that are available for getting the most out of relatively simple XML documents.

Valid Documents In addition to being well-formed, an XML document needs to be valid in order to be effective in a broader context. An XML document is valid when it specifies a particular structure and conforms to the requirements of that structure as expressed in a companion document, commonly a Document Type Definition or an XML-based schema. Validity ensures that an XML document adheres to the requirements of a specified document structure, thus assuring consistency across a group of documents conforming to the same structure. One of XML’s strengths is that it allows users to create their own markup languages within XML in the form of document structures. Document structures are variously referred to as schemas, document models, data models, content models, markup vocabularies/languages, XML grammars, and sometimes “applications” of XML. If a document using XML syntax follows a particular document structure, it is said to be an instance of that structure. When a particular model enjoys widespread use, it becomes a de facto standard, and with suitable endorsement, a formal standard. There are various competing mechanisms for creating formal document structures. Primarily, these are XML-based schema-creation languages and Document Type Definitions (DTDs), the latter being XML’s original document structure, derived from


31

SGML. DTDs and XML-based schemas vary in their ability to express more elaborate and powerful constraints on document structure than those of XML syntax alone permit, such as which elements are allowable, required, optional, etc.

Document Type Definitions The Document Type Definition is essentially a grammar for any XML documents which invoke that DTD. A DTD “declares” (i.e., defines or specifies) each of the elements, attributes, and entities that will be permitted in a document, as well as the relationships between them; it basically forms a template for the logical structure of associated XML documents. The DTD expresses the hierarchy and granularity of data, allowable attribute values, and whether elements are optional, repeatable, etc. The DTD containing these specifications is associated with a particular XML document via the latter’s document type declaration, or DOCTYPE. Optionally, the DOCTYPE is embedded in square brackets in the XML declaration of a document, although as a document collection grows this repetition is wasteful. Whenever a DTD is used as the document structure or model, the DOCTYPE follows the XML declaration as part of the prolog using this syntax:

The first set of tags in this example contains the XML declaration. The second set of tags contains the DOCTYPE; this declaration identifies the root element and designates the location or file name for the DTD, typically its URL (Uniform Resource Locator). The DTD itself, as in the following example, has its own syntax dating back to SGML:

It is easy enough to recognize that the above example specifies five elements, and that the first is a container element having two child elements indicated parenthetically and specified separately.

XML-Based Schemas While DTDs provide a great deal of control over what a document may contain, they do not provide the maximum degree of control over the structure of documents that have particular data formats; they do not provide datatyping (which specifies the kinds of data present, e.g., that a number is a dollar value), cannot allow for unconstrained (interleaved) element order, and so on. For more versatility in controlling the data that an XML document can contain, we prefer using XML-based schemas.

32


XML-based schemas accomplish much the same goal as DTDs in defining a particular document structure, but are more powerful. Their greater degree of control for data entry is particularly relevant here because of the complexity found in bibliographic data. Importantly, these schemas use XML syntax. This permits their creation and maintenance using the same XML editors that are used for ordinary documents. XML schema languages establish a document structure for expressing how to define other document structures, i.e., they define the rules for creating a schema using the elements, attributes, etc.—all in XML syntax. While XML-based schemas offer advantages over DTDs, the best way to define an XML grammar is still evolving. A Document Schema Definition Language (DSDL), integrating features of various approaches (e.g., grammar-based, rule-based, and object-oriented), is in development. Regardless of the approach taken, the end goal is a validated XML document—invisibly governed by an identified schema, but otherwise independent of it. It is mostly a matter of the degree and type of control over validation that distinguish various schema languages.

Namespaces One of XML’s strengths is the reusability of its data structures. It is often desirable to combine elements from two or more document structures in one document. However, because each document structure may be defined independently, there exists the possibility of ambiguity in element names. The XML namespace mechanism prevents such name conflicts by designating a unique prefix for each document structure referenced. This so-called qualified element name identifies the namespace using a Uniform Resource Identifier (URI) that is declared in its attribute with this syntax: xmlns:prefix=“URI”

Following the designation “xmlns,” a colon introduces an arbitrary prefix or local name for the namespace. The actual name of the namespace appears as the attribute value in quotation marks after the equal sign. This namespace attribute is often included in the root element, in order that all child elements will inherit it, but it may appear at any appropriate level. URIs serve as unique identifiers because they are not repeated across the Web. Parsers do not use them to look up information, although associated documentation may appear on the web page indicated. One namespace may be designated the default by omitting the colon and prefix. After each namespace is declared in this manner, elsewhere in the document the different prefixes followed by a colon introduce elements or attributes belonging to each namespace: prefix:element

In addition to their role in disambiguation, namespaces serve as a useful grouping mechanism to help software applications work more efficiently. Complexity increases when more than one document structure is involved. (Chapter 2 contains details and examples of namespaces.)


33

Stylesheets One of XML’s greatest strengths is the separation of content from presentation, or display. The content of well-formed and validated XML documents is consistently structured and thus will behave predictably. How the documents should look is another matter entirely—and is dependent on human and technical requirements, both of which are subject to change. Precisely for this reason, display markup is treated separately from semantic content markup in XML. Both work in tandem and must be coordinated, but treating each discretely provides great flexibility. This is especially useful in coping with unforeseen change, but also places fewer constraints on those working with XML documents for use in various contexts. This allows XML documents to be “repurposed” independently of their content markup. Librarians have a familiar example of the separation of content from display in many online catalogs. This capability of customizing once-fixed bibliographic displays in varying degrees by changing labels, reordering fields, etc., was added to various integrated library systems—each in a different way and with different limitations. XML offers more generalized solutions for presentation and display. These mechanisms allow not only catalog records but any documents and data in XML to be readily manipulated via separate stylesheets. The same document can be displayed in as many different ways as different stylesheets are written for various purposes, and as precisely as the XML document itself is defined. Librarians can reuse their knowledge of a stylesheet language from project to project, and from library to library. Other users of documents created by libraries will also have an easier time incorporating them into their environments. A stylesheet is a document that contains the layout and other display specifications—fonts, columns, headings, tabs, margins, etc.—for a particular class or group of documents. There are several different stylesheet languages (rules for creating stylesheets) and related tools for doing practically anything pertaining to display, as well as performing transformations with XSLT (XML Stylesheet Language Transformations) in order to slice, dice, and reassemble documents in whatever manner is required. Modularization plays a large role in this flexibility.

XHTML As stated previously, the separation of presentation from content makes XML very flexible. An XML document may be accompanied by a custom stylesheet to handle its display. However, using a predefined set of tags, it can also rely on the stylesheet built into web browsers. This approach involves XHTML (Extensible Hypertext Markup Language), a recent standard combining HTML’s tags and XML’s syntax to accomplish the same display functions as in HTML. Because of this origin, HTML can be converted to XHTML by following a small set of guidelines. Whether created manually, converted from HTML, or transformed from an XML document using XSLT, the resulting XHTML document is ready to display on the Web without a separate stylesheet. HTML’s familiar set of predefined tags is an SGML grammar defined by a DTD. XHTML, defined by another DTD, is a reformulation of HTML 4.01 as an XML

34


grammar. This is a good example of XML’s being a meta-markup language, one used to define other markup languages. XHTML is more rigorous than HTML; it conforms to XML, is readily viewed, edited, and validated with standard XML tools, and inherits XML’s benefits. To illustrate the parallels of the two ways of marking up display, consider this HTML document: Lane Medical Library . . .
Pay Library Fines with Food DonationsFor the month of December we are accepting food for unpaid fines. . . . [posted 2002-12-02]
Remote Access to Online Content Now Available!Stanford users can now gain access . . . [posted 2002-11-27]

Here is the “same” document marked up using XHTML: Lane Medical Library . . .
Pay Library Fines with Food DonationsFor the month of December we are accepting food for unpaid fines. . . . [posted 2002-12-02]

Remote Access to Online Content Now Available!Stanford users can now gain access . . . [posted 2002-11-27]

In this case an XML prolog was added, capitalization adjusted, and a few tags closed. While HTML and XHTML are similar, the differences between them are critical. Following these guidelines should ensure valid XHTML: • The XHTML root element is Also mandatory: , including , and elements


35

• • • • • •

Documents must be well-formed, with nested pairs of tags Elements must be properly nested, i.e., no overlapping tags Non-empty elements must have end tags; for example,
Empty tags must be closed: not just Element and attribute names must be in lowercase Attribute values must be present and quoted; for example, nowrap=“nowrap” • Replace the name attribute with the ID attribute for the following elements: a, applet, frame, iframe, img, and map • Replace reserved characters with entity references; for example, & becomes & • A mandatory must precede the html markup The easy part is that XHTML uses the same syntax as XML because it is XML. The tags are the same as HTML so that web browsers can handle them. This has permitted XHTML to enjoy a fair degree of backward compatibility with HTML. For details and future directions of XHTML, consult the full specification (W3C 2002b). Compatibility, as the next subsection stresses, is necessary to keep all the parts of the Web working together in concert.

Browsers and Viewing One of the motivations for developing XHTML was that HTML was becoming overburdened with additional markup tags and other “enhancements” that it was never intended to support. The companies making web browsers introduced product-specific “bells and whistles” in their efforts to win market share. When these features were used, HTML documents were prone to have display problems when viewed in a competing browser. XHTML does not permit such incompatibilities, highlighting XML’s focus on interoperability. Web browsers have continued to add support for XML since Internet Explorer first supported XML display in Internet Explorer 5. Version 5.5 of Internet Explorer supports XHTML display and XML as well, when there is a client-side stylesheet. It provides an internal default stylesheet for cases where no stylesheet accompanies an XML document. Netscape 7 and Mozilla 1.0 both support XHTML display and will display XML documents when there is an accompanying stylesheet. Netscape 6 has limited XML support, but can display XHTML when some concessions are made, such as adding a space to empty element names; for example, instead of . Using the latest browsers that support XHTML and server-side transformations of XML into XHTML output eliminates the need for worrying about client-side stylesheets. Until the newer web browser versions are widely deployed, transformations of XML documents into HTML ones provide a transitional solution.

36


The availability and continuing improvement in XML support in web browsers allow librarians to use industry standard solutions. This implied need for broader XML adoption in libraries is dealt with in chapter 3.

X Whatever The number of specifications and specialized tools in the XML family is a gauge of XML’s success. The benefit of having these tools is their generality. They focus on solutions to generic aspects of managing information on the Web, rather than problems specific to a particular field, such as library science. The same tools can be used with any XML document or application based on them. The common syntax and tools in the XML family also improve prospects for cross-community sharing. Beyond the core tools, there is an array of technologies competitively testing their way to superiority in addressing various problems. Many of these relate to processing XML documents. Some are niche solutions; others may coalesce to produce a new flavor, as has happened with some schema languages. There are tools for navigating and searching XML documents and document collections. These include XPath, which is used for referencing and navigating XML’s hierarchical structures. XPath underpins XLink (XML Linking Language), for hypertext linking between documents; and XPointer, for defining links within documents without having to add explicit anchors as in HTML. XLink also allows a single link to reference multiple related documents. There is also XQuery, for honing searches based on granularity of markup. Reassuringly, new tools continue to emerge, permitting the ever more sophisticated management of XML.

LIBRARIES’ STRATEGIC OPPORTUNITY Despite its high-tech mystique, the XML family of technologies offers relatively simple building blocks that can be combined to develop elegant solutions to complex problems. This toolkit approach bestows the advantages of modular solutions, which are particularly well suited to the ever-changing digital environment. Librarians are in a unique position to balance tradition and innovation. We have a comprehensive and long-term perspective, and we have impeccable values. For our profession to succeed, however, we must modernize our current technical infrastructure. While XML cannot solve all of our problems, it does offer foundational tools to help transform the way libraries do business. This can help prevent their obsolescence and nudge us into the digital mainstream, leaving us better positioned to serve our users. Librarians may well wonder where the long and winding information highway will eventually lead us. Many “innovations” will prove to be only transient. What is the role of libraries in the new digital environment? How do librarians’ high standards and cherished values (such as impartiality, trust, confidentiality, thoroughness, and lack of a commercial bias) fit into a digital world, where anyone can set up an information shop overnight and where big information interests apparently think that information can be


37

branded? Coping with proliferating digital resources can be frustrating. Despite some progress, librarians continue to face: • • • • • • •

the segregation of digital content by provider a multitude of incompatible interfaces to similar content limited flexibility of integrated library system interfaces problem-prone intersystem linkages users’ known reluctance to search multiple systems confusion in the bibliographic control of digital materials inadequate funding for digital resources

“Library information,” especially that in time-honored MARC formats and in proprietary integrated library system formats, has been segregated too long from mainstream web resources. Having an online library catalog isn’t good enough anymore, as users are likely to search other, more comprehensive resources first, and turn to separate catalogs only reluctantly. Nor does licensing and listing a few incompatible aggregations of resources on a web page make a digital library. How does XML address such limitations? XML solves a whole slew of problems in managing information on the Web. Whereas previously HTML, having one inflexible document structure, had to serve all purposes, XML takes a modular approach. Now, XML for display/presentation, i.e., XHTML, can homogeneously serve this role. SGML’s power without its complexity has been “ported” to the Web for content management as XML proper, allowing the flexible development of user-defined document structures. Both text and data content, and their appearance, can be defined separately using the same markup language. These documents can be stored and transmitted in robust, persistent, non-proprietary, and verifiable data formats for flexible use and reuse. With the coordination of document structures and stylesheets, XML can be communicated ready to serve. The manageability of information is especially important to libraries. While we may have a degree of control over much of our library data, currently we cannot really wield it effectively. Libraries, having pursued integrated library systems for decades, are coping with a digital divide. Traditional resources are segregated from digital ones, and digital resources are arbitrarily fragmented from one another. The venerable book offers a simple model for organizing a body of information. Content is carried in a linear sequence of text, tables, graphics, etc., punctuated with links to other information in the form of bibliographic references. This content is sandwiched between (1) an introductory table of contents and list of illustrations that provide a general user interface or portal and (2) an index that provides specific search access. This front-end versus back-end approach to accessing content finds some parallels in libraries’ public services (which are often responsible for organizing web pages leading to library resources) and their technical services (which are responsible for producing a database with indexing terms to improve searching for library resources). It is a challenge to coordinate such efforts, especially with the proliferation of digital

38


content. Both approaches, as well as random access to text in the middle of our hypothetical “book” via keyword searching, are equally valid, depending on a user’s needs at a given time. The trick is to provide all three avenues in a unified way to optimize access to information. Accomplishing this for the resources that libraries provide access to is far easier when all the components can be represented consistently—as books are able to do on a smaller scale. On a larger, digital scale, the foundational technology, the unifying tool, the glue for holding all the parts together, is XML. While XML will not solve all of libraries’ problems, especially the political and budgetary ones, it offers the strategic potential for libraries to selectively impose control on certain information resources in an otherwise haphazard digital environment. Having predictable flexibility and reusability is not inherent in XML; it requires agreed-upon schemas to permit this to be done readily and reliably. XML makes such coordination and sharing possible, but the onus remains on the library profession to adopt various document structures that, while separate, are intended for coordinated use. Rather than solutions that work in only one system or context, the same solutions can be incorporated in different systems or contexts. One example is user data. Circulation systems were designed to handle user data for limited purposes. Libraries have limited flexibility in trying to use or extend this data, or a subset of it, for new purposes, such as establishing book clubs, developing mailing lists, or controlling access to licensed digital materials. Differing patron record formats and lack of control over user data in proprietary systems present challenges to those trying to address local or emerging needs. Sometimes a local field or two is permitted in patron records, but more often, separate overlapping or redundant solutions follow the path of least resistance. A standard record format that could serve multiple purposes would simplify matters. Core elements need not be reinvented over and over with slight variations; decades of experience reveal patterns that, if agreed upon, could be reused. Because XML is extensible, it is possible to incorporate local or vendor-specific features into it without disrupting a core set of agreed-upon data elements that are held in common. It is somewhat like an erector set, allowing many different constructions from the same pieces. The difference is that with library information the pieces are more complex. XML’s simplicity belies its extraordinary potential to allow the controlled yet flexible deployment of new solutions to information problems as opportunities arise. The literature is full of oft-repeated good reasons for using XML. These generic reasons resonate when it comes to solving library problems, as attested by the four-page “XML: Libraries’ Strategic Opportunity” (Miller 2000), for example. The following recap of various aspects of XML’s generic suitability in a wide variety of situations focuses on ways that specifically relate to libraries.

Generic Aspects of XML Non-Proprietary / Interoperability / Platform Neutrality

Documents marked up in XML are application-neutral and can be manipulated on Macintosh, PC, Unix, or Linux platforms. This eliminates the technical problems that


39

are often the reason why different “integrated” components do not quite work together as advertised. XML facilitates sharing because the sending and receiving systems need not be the same. XML functions like a neutral third party. XML is essentially a standard for creating standards—a universal format for data and document exchange. For this reason, it is often called the lingua franca of the Information Age. XML is a key component of the open systems movement, allowing different libraries and organizations (both nonprofit and for-profit) to share information more readily, since the format is neither proprietary nor limited to a particular domain; this is often cited as a basis for cost savings for all concerned. However, reaching agreement and achieving the widespread adoption of standards remain the chief concern. One thing is clear: proprietary solutions do not facilitate industry-wide communication. One example recognizing the value of interoperability comes from the book and subscriptions commercial sector. Evolving from Electronic Data Interchange (EDI), ONIX 2.0 was released in 2001. This is an international standard for representing and communicating book industry product information in electronic form and has been mapped to MARC field equivalents. A sibling standard, ONIX for Serials, covers serial title information, serial items, and packages. Both standards are developed and maintained by EDItEUR, a group coordinating standards infrastructure for electronic commerce in these areas (EDItEUR 2002). The range of emerging XML-based standards is remarkable. MathML enables mathematical notation and content to be served, received, and processed on the Web. VoiceXML has made possible speech-enabled web applications. The Geography Markup Language (GML), promulgated by the Open GIS Consortium, supports descriptions of both the geometry and properties of geographic features. In addition, there are AML (astronomy), CML (chemicals), MML (music), ProML (proteins), and BSML (biosequences). These are in varying stages of maturity, but are indicative of a broad trend toward XML-based standards. Library information is needed in all of these contexts as well as its own. For libraries to be most effective on the Web, XML’s widespread use makes it the obvious choice as the universal format for document and data exchange. (Consult the introduction to chapter 3 for a discussion of the need for a suite of schemas for library information.) Perhaps due to the diversity of standards within an increasingly unified Europe, greater emphasis on standards and the need to coordinate them across domains is becoming apparent. One metadata review covering the audiovisual, cultural heritage, educational, and publishing domains provides concrete examples useful for librarians in considering a broader perspective on interoperability issues (Forum 2002).

Data Longevity / Data Persistence / Future-Proofing

Terms such as “data longevity” and “data persistence” are based on XML’s neutrality. Inevitable changes in hardware, software, and networking need not make data or document structures obsolete. XML offers insulation against such changes by having a platform-independent syntax, by using a universal character set, and by separating content from functionality. Eye-readable documents not dependent on particular technologies are

40


more likely to persist over time. The wide use of XML ensures that future systems will need to support reading this standard markup even after newer technologies arise. Word processing illustrates how proprietary formats make it difficult to maintain and share such documents over the long term. How many formats for the same information are really needed? What becomes of documents created with software, or software features, that are no longer supported, e.g., Visiword? Converting between some formats is possible, but with increased expense for developers and with additional effort by users. It is telling that new XML-based word processing products are beginning to appear, e.g., OpenOffice. Significantly, the next version of Microsoft Word (as part of Office 11) will be XML-based (Microsoft 2002). Whether software products can compete directly on the basis of functionality, rather than relying on exclusive formats that tend to hold information hostage, remains to be seen. The parallels in audiovisual formats underscore the difficulties. For preserving documents, XML’s neutrality currently makes it the best bet. Separation of Content from Display and Functionality / Repurposing

Separating content from display facilitates using XML documents for more than one purpose. Often the same or similar information needs to be combined and presented differently. XML’s own modularity serves as a model for solving problems in this way. There are many parallels in word processing; for example, using a merge feature to incorporate addresses from a list into a letter, treating a letterhead as a separate entity, changing styles from block to indented without rekeying, etc. XML generalizes this divide-and-conquer technique; it just requires planning and coordination from the outset to be effective. Quick and dirty solutions are just that and will likely be more costly in the long run. Similarly, solutions that work well in one proprietary system often do not transfer well to other systems. Moving documents between systems should be transparent. System design and maintenance should benefit from products (documents and data) being independent of the software used to create them. It appears to be a matter of competition based on software merit, rather than format lock-in, that hampers choice and flexibility. Keeping the independence of data in mind when making system decisions is important. Library systems will continue to change, but library databases are of more permanent value. How difficult or easy it is to move data between systems depends on their design. How much past effort had to be discarded when switching systems? Getting data out, for a special purpose or to change systems, should be of equal concern as getting data into a new system. Neutral, XML-based formats for all data belonging to the library would benefit vendors and libraries alike, eliminating the need for quirky application program interfaces. Librarians can develop and promote the use of standard formats for external systems, but they can also benefit more immediately from XML’s separation of content from display and functionality. It is mostly a matter of looking at documents and data from the following perspective. Addresses are addresses, whether for patrons, vendors,


41

or interlibrary loans. When a new need arises (e.g., managing passwords), is another format in order? Consider how different short and long bibliographic displays, new book lists, overdue notices, and bibliographic citations are; yet they all reflect similar information about library resources. The reusability of information depends on consistency. Defining that information in a larger context increases the likelihood that it can serve more than one purpose. Using XML increases the likelihood that information created for one purpose can be leveraged for other purposes. Extensibility

The X in XML stands for “extensible.” This means that we in the library profession can define standards flexibly. Libraries can add further elements and attributes to meet local needs; likewise, vendors can add their own elements to an industry standard. Namespaces provide a noninvasive mechanism to do this. Adding a local element to a patron or order record should be possible without disrupting the elements required for particular software to function using the same record. The additional information fits seamlessly into the syntax and can be carried along unobtrusively for the ride. Gradually extending standards to cover all types of library data need not be traumatic, since XML’s extensibility makes it easier for standards to evolve as needs change. Advanced Linking Techniques

XML’s linking techniques go beyond simple hyperlinking. XPath, XLink, and XPointer provide mechanisms to link between and within documents without having to alter documents to contain anchors. This can help accommodate the increasing emphasis on relationships in bibliographic data and permit authors to reference specific parts of documents. (See chapter 2 for treatments of these technologies.) Universal Character Set / Internationalization

Unicode, XML’s fixed character set, provides a single, comprehensive character set. This and platform neutrality are key to XML’s support of internationalization, another type of repurposing. Stylesheets and XSLT transformations can be used to present language-specific versions of data based on user-selectable preferences. For libraries, this particular strength of XML holds promise in online catalogs, particularly their authority control mechanisms. The following example illustrates a title in various languages. Attributes indicate the language of the element and the Romanization scheme when applicable. In the second Chinese example, entity references have been used to illustrate that it is possible to enter data not otherwise keyable in a particular system. XML’s predefined “xml:lang” attribute is also available for indicating the language of any element. Display fonts and related data entry issues are likely to be rapidly resolved now that the characters have been standardized.

42


Joining the Mainstream

All of the reasons for embracing XML relate to its promoting generalized compatibility within the internationally shared web environment. This extends to bridging the gap with non-XML resources. In particular, XML has been used to develop interfaces to many types of databases, including legacy ones. It is fairly simple to define an XML output format for the records stored in a relational database. Because XML database output shares the same syntax as an XML text document, the two can blend readily. This is particularly useful as a transitional strategy in knitting together incompatible information resources.


43

XML offers a vehicle for libraries to join the web mainstream. XML has the potential to provide a unifying technical infrastructure so that library information can be available directly on the Web. XML allows content to be delineated, essentially resulting in self-describing information; for example, knowing where the price is on a web page versus just knowing that some text should be bold makes an inestimable difference. XML is inherently flexible, separating the markup of this named content from display instructions. This allows XML documents from various sources to be more easily combined without anticipation of the need. It addresses information management at a fundamental level, allowing this same technology to accommodate bibliographic data and order transmission to vendors, for example. This generality will permit the convergence of all library data to one standard, eventually supporting all components of integrated library systems, and will allow them to integrate with other XML-compliant library systems and web resources beyond the library. XML and open-source software can serve as enabling technologies that allow us to participate more effectively in defining our future. Sharing an infrastructure is the first step toward a distributed, integrated international resource, the sum of which would be far more valuable than its parts.


ndrew Tanenbaum, in his book Computer Networks, ironically observes that “The nice thing about standards is there are so many of them to choose from” (Tanenbaum 2002). This is especially true of new technologies, like XML, that are still in the innovative stages of their development. Fortunately, there are several international organizations that attempt to provide some order to the myriad of competing XML-related technologies. These organizations release their findings as a “recommendation,” suggesting that a technology is developed enough for use in a production environment; or as a “proposal,” announcing that input on a potential recommendation is being sought. The two main organizations that govern related technologies in the XML community are the W3C, or World Wide Web Consortium (W3C 2003c), and OASIS, the Organization for the Advancement of Structured Information Standards (OASIS 2003). The W3C is the more important of the two, and manages most of the core XML technologies. OASIS is worth noting because it promotes a variety of other structured information standards and manages many of the more innovative XML-related technologies. One of XML’s greatest strengths is that the learning curve needed to begin working with XML documents is very low. At its most basic level, XML is a simple markup language that uses elements and attributes to describe the syntactical characteristics of a document’s content. Fortunately, the fact that XML is easy to learn does not mean it sacrifices power or flexibility for that simplicity. Unfortunately, the fact that XML is easy to learn does not mean that doing absolutely anything you want with an XML document or data fragment will necessarily be easy. While the formalized structure of XML is easy to understand and use, XML-related technologies often provide more processing power at the expense of this simplicity. This chapter attempts to introduce the concepts of these other technologies by working through several simple examples. Once the reader is ready to begin experimenting with these technologies, the tools in chapter 4 may prove useful.

A

44


45

GETTING STARTED XML’s uniformly structured markup allows the contents of any given XML document to be processed independently of the specifics of that document. For example, a catalog record that is marked up in XML may be processed by the same tools, and in the same way, as a library web page or a library’s serials list. This generalization of syntactical rules makes XML flexible. XML’s related technologies extend this flexibility to process an XML document or fragment in different ways. An XML document may be processed for display using XSLT, the Extensible Stylesheet Language for Transformations, or CSS, the Cascading Style Sheets language. It may be validated through the use of a DTD (Document Type Definition), an XML Schema, or a RELAX NG schema. Documents and data fragments may be linked together by an XLink or may have their internal structure referenced with an XPointer. In addition, text marked up in XML may be manipulated programmatically using SAX, the Simple API (Application Programming Interface) for XML, or by using DOM, the W3C’s Document Object Model. This chapter does not attempt to cover every XML-related technology. Technologies are selected for this chapter because they have been proven through widespread use (e.g., XSLT) or because they are essential for the long-term success of XML (e.g., XLink). This chapter does not cover technologies, like SAX and DOM, that require programming knowledge. Though many librarians know how to program, most do not. The authors believe, given the limitations of a single chapter, that it is better to address the majority of the readers. Those librarians interested in using technologies like SAX and DOM are encouraged to find information on each of these on the Medlane website (Lane 2002).

WHAT’S IN A NAME(SPACE)? Namespaces are an idea that XML inherited from programming languages like C++ and Java. They are not found in the current 1.0 XML specification, but are supported, and even required, by almost every XML-related technology. They will be a part of the new 1.1 XML specification once it is released as a recommendation. To use XMLrelated technologies, and to be prepared for the new XML specification, readers are encouraged to learn about namespaces now. The role of namespaces in XML, as in the other languages, is to provide a way to avoid the conflict of commonly used names for elements and attributes. Different XML namespaces are identified by different “xmlns” attributes. If more than one namespace is defined for a document or XML fragment, those elements and attributes that share the same namespace use a namespace’s prefix throughout the document or XML fragment. A namespace prefix, which is defined in the part of the namespace attribute’s name that follows the colon (e.g., the “xsl” in xmlns:xsl), precedes the name of any child element or attribute that belongs to that namespace (e.g., xsl:template).

46


The following example demonstrates what a namespace looks like, how it may be defined in the root element, and how elements that belong to a defined namespace are identified using the namespace prefix. The stylesheet and template elements in the example belong to the XSL namespace; the other tags are ordinary HTML tags that do not belong to a defined namespace. (Those familiar with XHTML, the well-formed version of HTML, might know that the listed HTML tags could belong to the XHTML namespace. In this example, they do not because the XHTML namespace is not defined.) A Simple Namespace Example
XSL tags are distinguished from HTML tags.

Why are namespaces beneficial? Take, for instance, a “Name” element. An element named “Name” probably occurs in at least 50 percent of the XML documents on the Web, and, more than likely, it means something different in the context of each. In one document, Name might represent the name of a library patron; in another, Name might represent the name of a rare book publisher. If one library is creating both documents, the conflict can be avoided by naming one element PatronName and the other element PublisherName, but, in the more common case, where two different organizations are independently creating the potentially conflicting element names, this avoidance is not possible. It is in cases like these that namespaces prove useful. Keep in mind that namespaces are not required in order to use XML. If a library is creating an XML document for internal use, and that document is not intended for use with XML documents developed outside of the library, it is possible, and maybe even desirable, to exclude the namespace attribute from the XML document’s root element. Choosing not to define a namespace may speed up processing XML documents and will certainly reduce the clutter in a document. Namespaces provide a way of avoiding conflicts between elements created by different institutions or organizations. The namespace specification dictates that elements and attributes be identified by a unique Uniform Resource Identifier (URI) that will, for example, distinguish between a publisher’s name in a schema published by an online book retailer and a publisher’s name in a schema published by a library. In practice, the URI used to make this distinction almost always consists of the authoring organiza-


47

tion’s URL, or Uniform Resource Locator. This is acceptable because URLs are a subset of the URI standard; they are also unique to a particular organization. A namespace that is not unique is useless because it will not necessarily distinguish one set of elements from another. XOBIS, the Lane Medical Library’s bibliographic and authority information schema (Lane 2002b), is a good example of namespaces in action. To ensure that the schema for bibliographic information developed at the Lane Library does not conflict with the one issued by the Library of Congress, Lane uses a Lane-specific URI to distinguish its schema elements. The comparison below shows the root element from the XOBIS schema and the root element from another library’s schema. Did you notice that the default namespace prefix is omitted? This is acceptable according to the W3C namespace specification. As long as a document has only one namespace, the namespace prefix is optional; since the example shows two separate documents, each having its own namespace, namespace prefixes are not required. [ . . .Record omitted to conserve space. . . ] [ . . .Record omitted to conserve space. . . ]

As a result of the distinct namespaces in the example above, the Record element that is defined in the XOBIS schema will not be confused with the Record element defined in the other library’s bibliographic information schema. To an XML parser (the software used by other XML-related technologies to process XML), the XOBIS Record element looks like {http://medlane.stanford.edu/ns/xobis/1.0}Record; the other library’s Record element looks like {http://www.biglibrary.com/marcInXML}Record. It is possible, and valid, to include a namespace on each element in an XML document, as a parser would, but using namespace prefixes keeps the document cleaner and easier for a human to read. The next example shows how the namespaces from the above example can be combined in a single document; using prefixes keeps elements from different namespaces distinct without adding clutter that hampers readability. [ . . .Record omitted to conserve space. . . ]

48

“The Nice Thing about Standards . . .” [ . . .Record omitted to conserve space. . . ] [ . . .Record omitted to conserve space. . . ]

It is important to note that the namespace URI does not actually have to be a real URL (http://machine.server.domain/path/to/resource). There is nothing in the namespace specification that requires this. A URI, used in an XML schema, is for identification purposes only. When an XML document with a namespace is parsed and processed, the target of the identifying URL is never checked. Though XML namespaces were not included in the 1.0 version of the XML specification, they will be in the 1.1 version. At this time, almost every other XML-related technology makes use of, or at least supports, them. Many XML technologies actually use namespaces to identify themselves as a set of XML elements and attributes that solve a particular problem. For instance, XSL stylesheets, which are also expressed in XML, have an “xsl” prefix indicating that they belong to the XSL specification. Since XML-related technologies are often XML-based, there is often a high degree of dependency and intermingling between the various toolkits. This can be a good or bad thing depending on one’s perspective. On the one hand, reusing standards makes learning a new XML-related technology easier; on the other, if XML technologies are too tightly integrated, there is little room for innovation.

I NEED SOME (XML) VALIDATION! All XML documents and data must be well-formed, i.e., they must conform to XML’s rules of syntax, in order to be read by a parser; but not all XML documents or data need to be valid, i.e., conform to a particular structure as defined in a companion DTD or other XML-based schema. However, if a library wants to share its XML data with others, it is a good idea to ensure that the XML the library exports conforms to a standard XML-based schema. There are many advantages to formalizing an XML data model with an XML schema. For one, being able to check a document for validity ensures better data integrity. Data that has been validated is also easier to share with others. Perhaps the most important reason to create an XML schema, though, is that creating a schema gives the schema creators a thorough understanding of their own data structures. A better understanding of a library’s data will enable librarians to capitalize on the strengths of XML to better represent their library’s information. XML schemas come in a variety of types. The first type of XML schema was the


49

DTD, or Document Type Definition. DTDs were actually inherited from the SGML community and so are not expressed in XML. DTDs work well, but many organizations and individuals, recognizing that storing information in XML has its advantages, started to wonder if validation schemes would benefit by being expressed in XML. In time, the W3C decided to create its own XML-based schema language based on the ideas expressed by the XML community. As a result, the W3C formed a committee and started work on XML Schema, a schema language for XML that uses XML to express the conditions for validation. Many in the XML community believe that XML Schema is the best standard for the validation of XML. Others feel that W3C’s XML Schema does not do the job as well as they would like; these people are exploring alternatives. RELAX NG is one of these alternatives. It is based on two earlier schema languages, TREX and RELAX. The ISO, the International Organization for Standardization, has recognized RELAX NG’s usefulness and established it as an ISO standard. Which schema language should a library use to validate its data? This is a difficult question to answer because it depends, in part, on what a library wants to do with its data. DTDs are probably the most widespread language type for validating XML documents and data. It is doubtful that they will disappear completely for a long time. There are, however, some significant weaknesses one must deal with when using a DTD. As a result, many organizations are standardizing on XML Schema, partly because the W3C recommends it. However, the authors, along with many others in the XML community, see RELAX NG as a great option for simple but powerful XML validation.

DTDs (Document Type Definitions) DTDs are the oldest available XML schema technology because they were adapted for use with XML after having originated in the SGML community. While DTDs do not use an XML-based language to express the conditions for validation, they do use a format that is familiar to many in the structured information community. In addition, they can be embedded in, or exist apart from, an XML document. If a DTD is external—i.e., it exists as a file apart from the XML document(s) that it validates—a DOCTYPE tag must be added to the XML declaration of each XML document which uses that DTD. The DOCTYPE tag identifies the file containing the relevant DTD, and refers the user to that file. Alternatively, if a DTD is stored within an XML document, it will be embedded in that document’s DOCTYPE tag. The DOCTYPE tag serves to tell the XML parser that the XML markup in the document can be validated against a DTD. The following examples illustrate different methods for linking a DTD to an XML document. If a library chooses to embed DTDs into its XML documents, its librarians should keep in mind that each document with that DTD must be updated every time a change is made to the DTD. For this reason, the authors suggest that a library use external DTDs if DTDs are the form of validation selected by its librarians.

50

“The Nice Thing about Standards . . .” ]> Sample document 1 This is an example of an internal DTD

In this example, the DTD is embedded in the XML document’s DOCTYPE statement. As a result, the XML declaration has the optional attribute “standalone” set to “yes,” meaning that the document does not rely on an external DTD. The important thing to notice is that the DTD is embedded in the DOCTYPE statement. This next example takes the other approach: Sample document 2 This is an example of an external DTD

In this example, the DTD is stored separately from the XML documents that use it. In cases like this, documents must still contain a link to the DTD so that an XML parser can use it to validate the document. This link also takes the form of a DOCTYPE tag. This DOCTYPE tag defines the document’s root element, DOCUMENT, and specifies the location of the DTD file. In the example above, the DTD referenced by a DOCTYPE tag is located in a separate file, document.dtd, in the same directory as the XML document. Since this DTD is only used on the current file system, the DOCTYPE statement indicates that it is a SYSTEM DTD. But DTDs may also be PUBLIC; this means that the DTD is available for use from any file system. PUBLIC DTDs have full URLs so that the parser can locate the DTD from any machine on the Internet. There are several types of components within DTDs. They contain parsed character data, mixed content, child lists, sequences, and choices. DTDs use these components to describe and validate the markup that is used in XML documents. Parsed character data, or PCDATA, is text that occurs within the context of an XML element. PCDATA may contain XML content or predefined entity references. Entity references, as explained in chapter 1, allow for external text, unkeyable text, or frequently repeated text to be referenced by a unique descriptor. When such a descriptor is used in an XML document, the entity descriptor appears


51

in the content of the document, and the entity itself is identified in the DTD. The following example demonstrates a case of parsed character data that contains content and an entity reference. This example and the examples that follow each begin with a segment of the DTD, followed by an XML instance that could be validated by the DTD segment. A Presentation by &AUTHOR; The future of XML is so bright librarians need to wear shades!

When the XML above is parsed and output, the “&AUTHOR;” entity reference will be replaced with the value of the entity declaration. The title of the presentation will read, “A Presentation by Joe Speaker.” This is valid XML because the DTD defines that the DOCTEXT element contains PCDATA. XML documents that are validated may also contain elements that contain both parsed character data and other XML elements; this combination is known as “mixed content.” Mixed content is not often found in metadata, but libraries that distribute full-text documents to their patrons might need to use it. Mixed content is similar to creating a paragraph tag that contains some italicized words. The following example demonstrates a mixed content data model. A Presentation by Joe Speaker The future of XML is so bright librarians need to wear shades!

In this example, the DTD allows the DOCTEXT element to contain parsed character data and the KEYWORD element. Mixed content allows XML to contain plain text and text that, for some special reason, needs to be segregated from the rest of the text in the container. This might be used to indicate keywords in a document or to assist with the automatic classification of full-text documents. Up to this point, most of this section’s examples have contained at least one hierarchical level at which more than one element occurs. In the case above, the DOCUMENT element has two child elements, TITLE and DOCTEXT. In DTD terminology,

52


this is called a “child list.” In a DTD, which is a relatively flat structure, child lists model the hierarchy of elements by which a document is structured. Child lists may contain sequences and choices. If one DTD element contains a child list with other elements as the children, those elements might then also contain child lists possessing children of their own. To keep child lists organized, sequences and choices describe the relationships between the children in the list. In DTDs, sequences describe the number of times that an element may occur in an XML document. For this, DTDs use symbols that are appended to the child element’s name. The plus sign indicates that a child occurs one or more times, the asterisk indicates that a child occurs zero or more times, and a question mark indicates that a child either does not occur or only occurs once. The Meaning of Life O’Reilly Caitlin Clarke Christy Clarke Stories of a Renegade Librarian

This example illustrates the use and flexibility of child lists in a DTD. In the example, both CITATIONLIST and CITATION contain child lists; the child list of CITATIONLIST, however, contains only one child. Since citation lists should contain at least one CITATION, the CITATION element in CITATIONLIST’s child list has a sequence symbol indicating that it must occur at least one time. There is no limit on the number of times the CITATION element may occur. Since not every bibliographic work will have an author, the AUTHOR element in CITATION’s child list contains a sequence symbol indicating it may or may not occur. Like the plus sign, the asterisk does not set a limit on the maximum number of times an element may occur. If, as in the above example, an element in a child list does not have a sequence symbol, the default is that it must occur once. Since works may or may not have a publisher, the PUBLISHER element’s sequence symbol is set to indicate that it may or may not occur, but if it does occur, it may only occur once. These three symbols give DTD creators a great deal of flexibility in defining the rules to which XML documents must adhere. There are, however, other conditions that


53

may also be applied to elements in a child list. These “choice” conditions, which restrict the grouping of XML elements, are controlled by the use of Boolean operators, something with which most librarians are already familiar. For brevity, DTDs use symbols to represent Boolean operators: a comma represents the AND operator and a vertical line represents the OR operator. This gives DTD creators the ability to control the number of times individual elements may occur and the number of times groupings of elements may occur. Elements are grouped in DTDs by putting parentheses around them. To distinguish, for instance, between different types of authors in the citation list, a group of authors is created with a comma or vertical line to indicate how choices from that group should be made. Caitlin Clarke Renegade Librarians, Inc. Stories of a Renegade Librarian

In this example, the elements AUTHNAME and TITLE can occur at the same level in a CITATION’s child list. The PERSONAL and CORPORATE elements cannot, however, both be children of the same AUTHNAME element. They can, as seen in the example, be children of different AUTHNAME elements. Since the AUTHNAME element is repeatable and can occur from zero to many times, this is not a problem for the document. DTDs are the foundation for validation in the XML world. There are, however, problems with using DTDs as the only form of validation. DTDs do not use XML; as a result, they require special tools and editors. Unfortunately, not many have been created, and DTD authors, for the most part, resort to simple text editors to maintain and create their DTDs. DTDs also do not specifically handle namespaces. XML documents that use namespaces can be validated by DTDs, but to do this one must make the DTD’s element names match the full name of the XML element and its namespace. This means the namespace must be repeated on every element and attribute in the DTD,

54


inflating the size of the document unneccessarily. This is not the case for XML-based validation schemes. Still, many people use DTDs because they are well established and not likely to change. If a library does not want to use namespaces, or any of the other XML-related technologies that use namespaces, DTDs are a good option.

XML Schemas Recognizing the problems with DTDs, some in the XML community started developing alternative, XML-based schema languages. XML Schema developed in response to a need for the standardization of these other languages. Though standardized by the W3C, XML Schema is largely based on a schema language developed by Microsoft. After deciding to use Microsoft’s XML schema language as the foundation for its own, the W3C created an XML Schema working group to discuss improvements that should be made to the existing DTD validation model. The working group identified four areas of improvement that they believed should be incorporated into the new XML Schema. The first, and probably most important, improvement to come from the working group was the idea of datatypes. Datatypes specify the kinds of data that are present in a document. They allow XML content, not just its structure, to be validated. If an element contains a time, datatypes allow a schema to check the content of that element to determine if its value is actually a time value. This idea is so important that the W3C has created a separate specification, under the auspices of the XML Schema working group, for XML schema datatypes. This separation allows datatypes to be used by other XML-based schema languages as well. The second improvement made by the W3C was to provide native support for namespaces. Unlike DTDs, which require that schema creators do more work, XML Schema provides a mechanism for dealing natively with namespaces. Namespace support is essential for most XML applications because many XML-related technologies rely on namespaces to identify element and attribute sets. The third change is that the W3C expressed its schema in XML. Doing this allows for the automation of schema processing and management. It also allows XML Schemas to take advantage of many of the tools created for editing and manipulating XML documents. The last improvement the W3C made was to create an extensible schema language. By introducing some basic object-oriented concepts into the language, XML Schemas are able to support new schema capabilities as they are needed. The unfortunate result of this was that the committee also introduced a great deal of complexity into the language. This section does not try to completely describe the XML Schema specification; there are whole books for that. Instead, it just attempts to discuss a few introductory basics. Getting started with XML Schema is much like beginning work with most other XML-based schema languages. All XML-based schemas start with the schema’s root element. The root element contains attributes that define the schema language’s namespace and the namespaces of other elements in the document. The root element of an XML Schema may also have an attribute that specifies whether the elements in the


55

document validated by the schema need to be qualified by their default namespace. If the value of that attribute is “qualified,” the elements in the XML document must be explicitly qualified by their namespace. The same requirement is made of attributes by setting the schema’s root attribute, “attributeFormDefault,” to “qualified.” [ . . .content omitted to conserve space. . . ]

XML documents that use an XML Schema to validate their markup do so by referencing the “XMLSchema-instance” namespace in the root element of the document. (See the example below.) Once this has been done, the root element can contain an attribute from the namespace that specifies where the document’s schema is located; this is the “schemaLocation” attribute. For instance, if a bibliographic record uses the schema created in the example above, its root element should contain a namespace attribute with the value set to the XML Schema Instance namespace value. It should also contain a schemaLocation attribute whose value is the target namespace from the example above. Structurally, the schemaLocation attribute’s value contains the name of the schema, followed by a white space and the physical location of the file containing the XML Schema. [ . . .content omitted to conserve space. . . ]

The structure of XML Schema relies on elements and references to elements. There are simple type elements and complex type elements. Simple type elements cannot contain any other elements, or any attributes. Complex elements contain other elements, usually in the form of “sequences” for ordered elements or in the form of unordered groupings of elements. The following example shows a complex type element that contains a sequence; the sequence contains two references to other elements. References work by pulling the target of the reference element into the reference element’s position in the schema. In the example, “Title” and “Author” become part of the “book” sequence.

56


While some other schema languages express all their conditions for validation as elements, XML Schema also makes use of attributes to control how a document is validated. Examples of this are the “minOccurs” and “maxOccurs” attributes. If an element is only allowed to occur ten times, it should, in an XML Schema, contain a maxOccurs=“10” attribute; if there is no limit on the number of times the element should occur, it should contain a maxOccurs=“*” attribute. Elements that should occur at least once receive a minOccur=“1” attribute, while elements that may or may not occur receive a minOccur=“0” attribute. Though a bit more verbose, this is less confusing than DTDs’ use of symbols like the plus sign, an asterisk, and a question mark to control occurrence. It also provides greater flexibility because it is possible to set the exact number of times an element should be allowed to occur. The occurrence of attributes, like elements, can also be controlled with XML Schema. Attribute occurrence is specified by the “use” attribute of the XML Schema element named “attribute.” Unlike elements, attributes only occur or do not occur; there are not multiple occurrences of the same attribute. As a result, if an attribute is required for a document to be valid, the “use” attribute’s value should be set to “required.” The example below is of a “PatronName” element that contains an attribute, “appellation,” and a text value. For comparison, the same structure in RELAX NG, another XML schema language, is included just beneath the XML Schema excerpt.


57

One reason the XML Schema version is more complicated than the RELAX NG method is that XML Schema was designed using object-oriented principles. This allows XML Schemas to extend schema “objects” so that complex combinations of characteristics can be built based on the inheritance of characteristics from other elements. In the example below, the root element is “LibrarySchoolPatron.” Library SchoolPatron inherits the characteristics of “Patron” and the other elements that it extends. The base element in the example is “PatronName”; just as in the previous example, PatronName contains a text value and an appellation that can record whether the patron is a Mrs., Ms., Jr., Mr., or Sir. There is also a Patron element. This element has a child element which indicates that the content of this element, unlike the PatronName element, is complex.

58


The “extension” element, in Patron, indicates that the Patron element extends, or inherits the characteristics of, the PatronName element. In addition to the information that it inherits, the Patron element also has its own unique characteristics; it contains a “PhoneNumber” and at least one, but up to five, addresses. The Patron type can now be used to create LibrarySchoolPatrons, MedicalSchoolPatrons, PublicLibraryPatrons, or any other type of library patron. In the above example, a LibrarySchoolPatron is created from the generic Patron type. It is interesting to note that RELAX NG also allows for the reuse of schema components; unlike XML Schema, however, it does this by defining patterns that can be reused rather than by using elements whose characteristics are inherited. (For comparison, a very simple example is included in the RELAX NG subsection following this one.) One thing that RELAX NG and XML Schema do have in common is that they can both use the XML Schema datatype specification to define the types of values one would expect to find in elements and attributes. Consider the example above. In it, the Patron’s phone number is represented by an attribute called PhoneNumber. Phone Number is an XML Schema string type, similar to any other XML element’s value. Using datatyping, it is possible to ensure that the text in the PhoneNumber element actually takes the form of a U.S. phone number. The following example demonstrates how XML Schema does this.

A phone number is actually a string (of characters) with particular characteristics; in this case, the “restriction” element restricts the value of a phone number type to a string value with the appropriate characteristics. This is accomplished by creating a regular expression to which the string must conform. Readers unfamiliar with regular expressions should not worry too much if the pattern in the example above does not make sense. Regular expressions, essentially, describe a string’s characteristics using predefined character values. For instance, in the phone number pattern above, the ^ at the start of the pattern indicates that the phone number value must match from the start of the string. The $ at the end indicates that the phone number value must also match from the end of the string. The characters in the middle of the regular expression pattern indicate that the phone number value must have an area code in addition to the regular three-four combination of digits found in phone numbers in the United States. The question marks in the regular expression pattern above indicate that the parentheses around the area code in the phone number’s value are optional. The regular expression also ensures that hyphens are used to separate segments of the phone number, including the area code segment if parentheses are not used. Regular expressions can be used to match almost any kind of string.


59

XML Schema addresses specific dissatisfactions with the use of DTDs to validate XML documents and data. The validation language is XML; values may be datatyped; namespaces are supported; lastly, elements may be built on other elements, inheriting sets of characteristics in an object-oriented fashion. XML Schemas also have the advantage that they are a W3C recommendation. For many in the XML community, however, XML Schema’s emphasis on extending elements to create more complex elements introduces an unacceptable level of complexity. They believe that most of the useful functionality in XML Schema can be implemented in a simpler fashion by defining validation patterns against which an XML document should be validated.

RELAX NG Schemas RELAX NG (Regular Language for XML Next Generation), and the two schema languages on which it is based (TREX and RELAX), are attempts to simplify XML validation without sacrificing power or flexibility. RELAX NG has many advantages over both DTDs and XML Schemas. The first and most important advantage is that RELAX NG is easy to understand and use. HTML became the success that it is, in part, because it was accessible to a wide variety of people. Likewise, this book’s authors believe an XML schema language should embrace the goal of simplicity. RELAX NG’s second advantage over the other schema languages is that it is built on solid computer science (tree automata) theory. RELAX NG provides strong support for unordered content and modular datatyping (being able to use different datatype sets with the schema language), making it a good choice for a schema language. The last reason the authors believe libraries should consider RELAX NG is that it treats validation as a process with two separate inputs, the schema and an XML document. XML Schemas and DTDs require that a reference to the schema used to validate a document exist in the XML document. This document-schema pre-coordination limits processing flexibility. There are several good implementations of the RELAX NG schema language. Staff at the Lane Medical Library use the Java program Jing to validate XML. James Clark, Jing’s creator, has been relatively outspoken about the benefits of RELAX NG. In addition, references and tutorials on it are not hard to find on the Web. Those interested in learning more about Relax NG should visit the XOBIS website for pointers (Lane 2002b). The first part of any RELAX NG schema is the root element, “grammar.” The “grammar” element is where the datatype library and document’s namespaces are defined:

60

“The Nice Thing about Standards . . .” The RecordList element is the root element in XOBIS, XML Organic Bibliography Information Schema. It contains one or more Record elements [ . . .omitted to conserve space. . . ] [ . . .other definitions omitted to conserve space. . . ]

In this example, the namespace of the schema language is defined in the “grammar” element. This is the default namespace for all the elements in this document. So that this schema does not have to rely on XML comments to record documentation, an “annotation” namespace is also defined. This is okay because RELAX NG validators ignore elements whose namespace differs from that of the elements they are validating. The “definition” element above is an example of an element from the annotation namespace. It is not checked by the RELAX NG validator, but its structure can be used by other programs to output schema documentation in HTML or PDF formats. Creating an annotation namespace also allows librarians to maintain notes and annotations that are intended for intralibrary use. Once the schema and annotation namespaces are defined, the namespace and datatype library should also be defined. The datatype library may be the one distributed by the W3C or it may be one created specifically for a library’s project. It is indicated by the “datatypeLibrary” attribute. The namespace of XOBIS records is set using the “ns” attribute. On the RELAX NG website (OASIS 2002), there are several RELAX NG schemas, the most notable being a schema to validate the RELAX NG schema language and another to validate XLinks in a document, something the XML Schema language has not yet accomplished. There are also additional schemas that others in the library community might find useful; for instance, a RELAX NG schema for the DocBook format is available. After defining the RELAX NG grammar element, the “start” element must be added. Since RELAX NG schemas are designed to be modular, the start tag informs a validator what the root element of the XML document should be. In the case of a XOBIS document, the root element is “RecordList.” In the first example, the contents of the root element are a reference to a RELAX NG element defined as “RecordList.” Creating references to defined schema elements allows discrete units of information to be reused in different places and contexts. In the example above, references serve to keep the schema’s organization clear and comprehensible to the humans that edit the schema. It is also possible, however, to create the start element so that it includes the first level in the XOBIS schema, as in the following example:


61

[ . . .omitted to conserve space. . . ] [ . . .omitted to conserve space. . . ] [ . . .omitted to conserve space. . . ] [ . . .the rest omitted to conserve space. . . ]

This example demonstrates how references and element definitions work. A RELAX NG schema reference (“ref”) simply refers the validation engine to another part of the schema document. Information from that location is included in place of the reference.

62


The information defined in a “define” element does not neccessarily have to be an element like that in the above example. It is possible to put any schema pattern in there as long as the result, when the reference is replaced with the data from the define, is well-formed XML. Using references and defines is a way to modularize the schema so that there is little duplication. This pattern approach to validation accomplishes many of the same things that the object-orientation of XML Schema does, but without introducing as much complexity. An element from the example above that might be new to the reader is the “oneOrMore” element. Like DTDs, RELAX NG schemas allow the schema creator to control the number of times an element should occur in a valid XML document. In addition to oneOrMore, there is also a “zeroOrMore” occurrences element that can be used. These elements work just like their names suggest. Anything contained within a oneOrMore element should occur, as a unit, one or more times. In combination with these occurrence limitations, RELAX NG also allows librarians to create a choice between possible schema structures. We at Lane Medical Library have found RELAX NG’s simple choice mechanism much more flexible than those of a DTD or XML Schema. Choice works by establishing a pattern; elements and attributes can be included within the options:


63

In this example, the “author” element has a variety of possible structures. First, there may be one or more names associated with a single author element. If there is more than one name, the type attribute can be used to specify what type of name it is (pseudonym, personal name, corporate name, fictional name, etc.). Once there is at least one name, the structure of the “name” element may also vary. The name element might have child elements named “first” and “last,” or it might have attributes with those same names. According to the schema, the name element might also have a first name attribute and a last name element. The “choice” element in RELAX NG allows for a great deal of flexibility, but suppose more control is desired. In this case, the “group” element can be used to precoordinate elements that should appear together:

This example demonstrates a “choice” element that separates the use of elements and attributes; if an element’s first name is selected, the last name must also be expressed as an element. The “group” element makes this possible. Grouping allows groups of elements or attributes to be required to appear together. If they do not, the document does not validate.

64


The preceding example is fine for names in the United States, but suppose some of the authors are from China, where a person’s family name appears first and their individual name appears last. A RELAX NG schema can deal with schemas that want to handle both types of names by using the “interleave” element. The interleave element supports flexible unordered structures:

In this example, while the “group” element groups elements and attributes together in the order that they appear in the schema, the “interleave” element allows elements to be grouped together without requiring that they appear in the order that they do in the schema. The curious reader might wonder why the “interleave” tag is not used for the “first” and “last” attributes as well. The attributes in a RELAX NG schema, just as in DTDs, do not maintain any particular order. The elements in the “interleave” grouping, on the other hand, will now be valid if the family name comes before the individual name, and vice versa. RELAX NG, like other XML schemas, offers the option of having optional elements and attributes. This is done in one of two ways. The first is to use an “optional” element; the second is to use “zeroOrMore” tags. RELAX NG’s pattern structure


65

allows for restrictions which are not possible when a DTD is used to validate a document. For instance, the next example demonstrates some of RELAX NG’s flexibility. Unfortunately, this flexibility sacrifices brevity: primary variant

This excerpt contains two sections. The main part is referenced first and contains a schema reference that pulls in the information from the “idContent” element. This is useful because the idContent information is used in the same form in two different parts of the schema, once after the first ID and then each time after that as the ID occurs within the context of the “zeroOrMore” element. RELAX NG’s ability to define a pattern and then reuse it makes schema writing much more enjoyable.

66


The most interesting part of this schema excerpt, however, might be that there is the potential to create multiply occurring ID elements, with the first being different than all the others. This is accomplished by assigning a “level” attribute that distinguishes between primary and variant IDs. The ability to define patterns that can be used within different choice contexts gives schema writers a great deal of flexibility. Did you notice that, in the above example, an attribute can be conditionally required based on the occurrence of another attribute? By nesting optional elements, a “form” attribute may be optionally included, but only after an ID attribute has been used. IDs may be used to signify that this element has an authority record in the catalog. If it does, the “form” attribute indicates what form of the authority record to use in a particular context. RELAX NG can also perform many of the same restrictions that the other schema languages can. For instance, enumerations of values and datatyping are two things that are built into the RELAX NG specification. The following example demonstrates them both: create update suppress delete

Values in RELAX NG schemas can either be text values, empty values, enumerated values, or data values. Most of the previous examples have used text values, but setting an element as the element’s child can validate elements that are required to be empty. The last example illustrates that elements may have datatyped values and that RELAX NG schemas can require that attributes have enumerated values. By placing a “choice” in an attribute, and following that “choice” with a list of acceptable values, the value of an attribute can be restricted to one from a particular enumeration. Overall, RELAX NG enforces very powerful restrictions while managing to keep the schema language simple. Each library must make its own choice about the valida-


67

tion method it decides to use. Whichever method is selected, validating XML ensures data integrity, makes it easier to exchange information with other libraries, and gives those creating the schema a better understanding of their organization’s information. This knowledge will come in handy when the library needs to create stylesheets and the stylesheets need to accurately display all potential elements and attributes of the defined XML schema.

WHERE DO I GO FROM HERE? Perhaps the most important XML-related technology in this chapter is the XPath standard. It is not important because it is the most widely implemented or the most popular technology; XPath is important because it, more than any other, is the foundation for all the other recommendations of XML-related technologies. XPath is a language for identifying the locations of particular elements, attributes, and text values within a document by means of the tags’ relative positions, sequences, and content. XPath is important because it provides technologies like XQL, XQuery, XSLT, and XPointer with a way to identify specific parts of an XML document and navigate within its hierarchical structures. Without XPath, it would be impossible, or at least very difficult, to create stylesheets; navigate to other XML documents and point to their contents; locate particular elements, attributes, or text values within a document; or query an XML document. The flexibility that XML and its related technologies provide would be greatly diminished if they did not have the XPath specification on which to build their own unique sets of features. Another seminal XML-related technology is the XLink specification, which furnishes a set of instructions by which to define and create links between objects in a document. XLink was one of the earliest XML-related technologies to be proposed, probably because most of the people involved with the XML standards process recognized that linking was one of the fundamental technologies that spurred the rapid growth of the World Wide Web. Though not yet fully implemented, XLinks are expected to be the next generation of web links. At its most basic level, XLinking is very similar to HTML linking. XLinks do not stop there, however; the linking recommendation builds on HTML’s limited functionality and greatly enhances a document’s linking capabilities. Since this chapter is only intended as a simple overview of XML’s related technologies, examples will be limited to what is possible today, but XLink’s future potential and its advantages over HTML’s linking mechanisms will be discussed. Building on the XLink specification, XPointers provide a method for locating and accessing specific parts or fragments of a target XML document for the purpose of linking to it regardless of the linking structure of the document itself. Prior to XLinking, document creators had to add anchor elements with a “name” attribute if they wanted to specify which parts of their HTML document should be accessible by links. XML and the XPointer specification improve on this by allowing any part of an

68


XML document to be linked-to without requiring an embedded anchor tag in that document. XPointer accomplishes this because, unlike HTML, XML has a well-formed structure that can be consistently referenced using XPath expressions.

XPath XPath is so called because it expresses paths by which a computer system can locate and move among the elements, attributes, and values within an XML document. XPath’s data model conceptualizes an XML document or fragment as a hierarchy of nodes. A document’s nodes include the root (element) node, element nodes, text nodes, and attribute nodes. XPath performs its function by selecting particular nodes from within the hierarchy of nodes that make up a document’s structure. A simple XPath expression can refer, for example, to the first (or any other sequential) occurrence of a particular element; the position of a child element relative to the parent element; any elements with specific attributes; and so on. More complex XPath expressions may involve movement through the hierarchy, text filters, contexts, and selections. XPath expressions are used by XPointer to identify the specific locations within a document which XLink then links to. The following examples are simple XPath expressions that reference a particular part of an XML document by linking together its nodes. /root/child/leaf /libdb/record /book/chapter[2]/section[5]/subsection[2]/example[1]

The first example is the full path from the document’s root node to a child element that has a leaf element. A leaf element is an element that does not have any child elements. In the example, the leaf element is also an element named “leaf.” The leaf element may either be an empty element or it may have a text value. The second example above references all the “record” elements in a document whose root element is “libdb.” The last example gets a little more complicated. Its XPath expression includes filters, or predicates, that instruct an application how to handle the occurrence of multiple elements with the same name in a given context. (A context is the current location within the hierarchical structure of an XML document, the perspective from which an XPath expression must be evaluated.) This XPath example says that, from the root element “book,” the application should find the second “chapter” element. From the second chapter element, it should find the fifth “section” element and its second “subsection” element. From that subsection, the XPath expression should find and return the first “example” element. By navigating through an XML tree, any XML node can be referenced and returned. It does not matter if that node is an element, an attribute, a text node, or the root node: child::record


69

attribute::* text() /descendant::figure[position()=42]

XPath expressions do not have to start at the root element of an XML document. In addition to selecting the full path from the root node, it is possible to specify a path from any particular node of an XML document. This node is called the context node. In the example above, the first XPath expression selects all “record” elements that are children of the context node. The context node might be the root element, if that has previously been selected, or it might be an element deep within the hierarchical tree. XPath expressions may be used to select any node within a document. The second and third XPaths in the example above respectively select all the attribute nodes and the first text node from the context node. If the context node is a record element, the second XPath would select all the attributes of that record element. The third XPath expression returns the text of its current, or selected, node. Probably the most complicated XPath expression yet is the last one in the example above. This XPath combines node context, predicates, and node selection to retrieve the forty-second “figure” element of the root node. To understand how this works, let’s examine the expression step-by-step. The first part of the XPath expression, “/descendant::figure,” says to select a “figure” element that is a child of the root node, or root element. Then evaluate the next step of the XPath expression; it indicates that the figure element selected must be the element in the forty-second position. Plainly stated, this expression references the forty-second figure element at that level in the hierarchy. In all the previous examples, XPaths were constructed using nodes that have an absolute path. Some elements, however, may be found in different contexts in a single XML document. Take, for instance, an example where there is an Author element under a MainEntry element and a repeatable Author element under an AddedEntries element. For cases like this, a generic XPath must be constructed. A generic path will, for instance, find authors with a particular name regardless of their element’s specific context. //Author[text()=“Mickey Miller”] //Author[@first_name=“Dick”] .//Author ../Author

The first XPath in the example above demonstrates this. Instead of starting the path with a single slash that represents the root node, two initial slashes are used, indicating that any Author element in the document, regardless of its context, should be considered. The next part of the first XPath applies a filter, or predicate, to limit the XPath to those authors whose text value is “Mickey Miller.” This will return /Record/ MainEntry/Author and /Record/AddedEntries/Author if the value of each Author element is “Mickey Miller.”

70


This generic method of referencing can be used for any type of node, not just element nodes. When there is a document with an Author element that has an attribute called “first_name” and another called “last_name,” the second XPath expression in the example above will return Author elements that have a first_name attribute equal to “Dick.” Generic referencing can also start from a particular context. Suppose an XPath should only reference certain Author elements. For instance, the XPath should only reference Author elements in the current node, /Record/AddedEntries, but in the document there is an Author element node, with a different context, /Record/MainEntry. If the .//Author XPath from the example above is used, the path will point to Author elements that are found within the current context, returning only those from the AddedEntries tree. XPaths can also refer to other nodes that are higher in the XML hierarchy than they are. An XPath that starts with two dots will move the current context to the parent element’s node. The last XPath in the above examples will find all the Author elements that are at the same level in the hierarchy as the current node. In summary, XPath is a flexible referencing scheme that allows other XML technologies to uniformly refer to and select parts of an XML document. It is comprised of nodes and uses filters to limit the set of nodes returned. Using XPath, it is possible to move up or down the XML tree while evaluating an XPath expression and to select a node from any given context. XPath expressions may be evaluated from the root node or from within other contexts within the XML hierarchy.

XLink XLinks define how one XML document references and links to another. They are similar to HTML links in their simplest form, but differ greatly when more complex forms are used. By defining a set of linking attributes that can be applied to any XML element, the XLink specification offers more flexibility than HTML’s linking mechanisms. For instance, any XML element may contain an XLink; this is unlike HTML linking, which requires the use of a special linking element. Still, the XLink standard is considered an extension of the HTML linking standard. XML linking was built on, and designed to be backward compatible with, the HTML standard. The following example demonstrates the differences between HTML links and XLinks by contrasting an HTML link with an XML fragment containing two XLinks. iBiblio Public Library iBiblio Public Library Library of Congress


71

In all the cases above, elements act as links to other pages on the Web. Clicking on “iBiblio Public Library” in the first XLink example will produce the same result as clicking on “iBiblio Public Library” in the HTML example above it. The only minor difference is that the second XLink in the example above allows for the link to have a human-readable title (“Link to the Library of Congress”) that can describe the link. As a result, different information may be used for the title and for the part of the link that a patron would click. Most would agree, given the networked state of the world, that linking is an essential part of any new standard for exchanging information over the Web. Many believe the ability to easily link to other documents is what made the World Wide Web so popular. Still, if HTML linking works so well, why is an XML-specific linking standard needed? HTML links have many limitations that XLinks do not have. HTML links must point to a single document. They can only be traversed in one direction. They also cannot link into documents that are not owned by the library unless, of course, the external document has an existing anchor tag. HTML links also do not support link histories that are independent of the patron’s browser. Lastly, HTML links depend on external scripting languages, most notably JavaScript, for important functionality. With these limitations in mind, the W3C set out to define a better linking mechanism. The XLink specification is the result. To be fair, it is important to note that many of the features described in the XLink specification are not yet implemented. Most of the simple XLink linking features have been implemented, however. Simple XLinks provide for functionality not found in HTML links. A link in the XLink namespace will have a type attribute value that is either simple or complex. Both simple and complex links have hypertext references just like HTML anchor tags, but XLinks also have a variety of other attributes. One of these other attributes is the “show” attribute. iBiblio Public Library Library of Congress

72


The “show” attribute describes how a link should display. The options for show include “replace,” “new,” and “embed.” The simple XLink in the example above has the show attribute set to “embed.” This means the link is embedded in the page that is displayed. If a link’s show attribute is set to “new,” a new browser window would open with the link’s target. If the show attribute is set to “replace,” the current display is replaced when the link is clicked. Those familiar with JavaScript will recognize this functionality. The XLink creators believed linking should not have to rely on external scripting language for valuable functionality. Another XLink-specific linking feature included in the example above is the “actuate” attribute. The actuate attribute determines how an XLink should load. The actuate attribute can be set so that a link loads as the page loads, like in the example above, or only when requested by the patron. Traditionally, links in the HTML anchor tag have loaded “onRequest” and links in HTML image tags have loaded “onLoad.” In the above example, there is also an “image” element which has a link to the graphic that should be displayed when a patron visits the page. This type of behavior is similar to how an image tag in HTML behaves, but rather than having to put the image in an HTML “img” element, it can be put in an element named logo, snapshot, or thumbnail. The image could also have been put on a tag named “image,” as in the example above. To demonstrate the flexibility this gives website creators, consider some of the other possibilities for an “image” element. If the actuate tag is set to “onLoad” and the show tag is set to “new,” a new window would have opened with the image when the patron’s browser loaded the page. XLinks can also be used in conjunction with XML databases to dynamically serve the most recent information to a library’s patrons, as shown in the example below. Suppose, for instance, a library’s reference department maintains a bibliography of resources that are popular with that library’s patrons. Rather than maintain a paper bibliography that essentially duplicates the information already in the library’s catalog, why not consider XLinks to dynamically pull bibliographic citations, marked up in XML, from the library’s catalog? When the “show” and “actuate” attributes are set to activate the link “onLoad” and display the result embedded in the currently displayed page, it is easy to see how XML could be used to pull citations from the catalog automatically.


73

While much work remains to be done on the XLink standard, one can see that it could simplify the management of information in ways that simple HTML linking has not been able to do. Linking is of the utmost importance to the success of XML. While the XLink standard has not driven many complete implementations, there is still a great deal of interest in it. There are also many partial implementations available.

XPointer Of all the technologies discussed in this chapter, XPointers are probably the most experimental. XPointers provide a means of locating and accessing specific parts or fragments of a document for the purpose of linking to it. The XPointer specification was designed for use with XML’s linking technology, but it can also be used independently of the XLink specification. XPointers are built on the foundations laid by the XPath specification; they locate portions of a document by using XPath to traverse that document’s hierarchical structure. XPointers, for those librarians familiar with the mechanics of the World Wide Web, are expressed as URI (Uniform Resource Identifier) fragments. However, unlike HTML’s use of URL (Uniform Resource Locator) fragments, XPointers do not require an anchor tag to be placed in the document into which the XPointer links. Instead, XPointers rely on XML’s uniform structure to reference nodes and ranges of nodes within the XML structure of the target document. The following examples consist of one HTML link and three XPointer links: http://lane.stanford.edu/bibliography.html#section1 http://lane.stanford.edu/bibliog.xml#root().child(1,reference) http://lane.stanford.edu/bibliog.xml#root().child(1,reference).child(2,author) http://lane.stanford.edu/bibliog.xml#id(23).child(3,author)

Since XPointers are expressed as URI fragments, in a URL they follow the pound sign. Those familiar with HTML will recognize URI fragments as the place where HTML anchor information is also stored. The HTML link above requires that a “section1” anchor tag be placed in the “bibliography.html” document. This is not true of the XPointer links in the example because they rely on the structure of the XML document. For instance, in the first XPointer link, the path starts with the root element. From there, the XPointer points to the first “reference” element that is a child of that root element. This is the target of the pointer. The second XPointer link is only slightly more complicated. After finding the same “reference” found in the first XPointer link, this path indicates the XPointer should point to the second “author” element of the selected reference element. The full path of any XML document can be traversed in this manner. XPointers, like XPaths, do not have to start with the root element. Instead, they can reference an element that has been identified by an ID attribute. The last XPointer link in the example above demonstrates a case of pointing into an XML document using the ID attribute. This XPointer finds an element whose ID equals 23. Once this has been located, the XPointer points to the third author element that is a child of the element whose ID equals 23.

74


For backward compatibility with HTML, the XPointer specification also allows XPointers to start with an “html(anchor)” root that will look for HTML anchors that have been already defined in an HTML document. The first link in the following example is an XPointer with this functionality. The XPointers below it demonstrate the other relative locations that can be used to point to particular parts of an XML document. http://lane.stanford.edu/bibliography.html#html(citation1) http://lane.stanford.edu/bibliog.xml#root().descendant(4, title).psibling(2,author) http://lane.stanford.edu/bibliog.xml#root().descendant(3, title).fsibling(2,author) http://lane.stanford.edu/bibliog.xml#id(14).ancestor(2).preceding(1) http://lane.stanford.edu/bibliog.xml#root().descendant(1).following(1)

Descendant, ancestor, preceding, following, “psibling,” and “fsibling” are all other navigational methods that can be used, along with “child,” to link to different relative points within an XML document. The “descendant” instruction selects from any of the content or child elements of the currently selected element. The “ancestor” instruction selects from elements that are above the current element in the XML hierarchy. “Psibling” selects from elements that precede the current element in the context of the same parent element. “Fsibling” selects elements, within the same parent element, that occur before the currently selected element. “Preceding” and “following” search through all the elements in the XML document that are found before or after the currently selected element. For instance, the second XPointer in the example above references an XML document by locating the root element. From there, the “title” element, which occurs in any of the children of the root element, is selected. This will find title elements that occur right beneath the root and those that occur several layers beneath the root. The XPointer stops when it finds the fourth occurrence of a title element. From there, the XPointer references the second author element that occurs at the same level in the hierarchy. The “author” found must occur prior to the selected title element. If there is no second author element that occurs before, and at the same level in the hierarchy as, the currently selected title element, this XPointer will fail to point to anything. The following example shows an XML document that does have an element to which this XPointer can point. http://lane.stanford.edu/bibliog.xml#root().descendant(4, title).psibling(2,author) Clarke, Caitlin Stories of a Renegade Librarian Miller, Dick XML: Strategic Opportunities for Libraries


75

McCormick, Robert XML Case Studies Buttner, Mary Yates, Charles XML Tools for the Library

Given this XML document, the XPointer listed would point to the author element with “Buttner, Mary” as its value. As demonstrated, “descendant” and “ancestor” operators move through the hierarchy based solely on their position in relation to the currently selected element. “Psibling” and “fsibling” move out from elements that are within the same parent element as the selected element; “preceding” and “following” move through the document without regard to hierarchy. If “preceding” and “following” are used in an XPointer, imagine the XML document as if it were on a sheet of paper. Anything about the current element would be accessible by using the “preceding” operation, and anything beneath the current element would be accessible through “following.” To summarize, XPointer allows us to point into a document even if an element’s name is unknown. XPointers may point to a particular named element based on its relative position from the root element, from an element with an ID attribute, or from an existing HTML anchor tag. Nodes can also be selected by their type: element, comment, attribute, etc. XPointers can also be used to conduct string searches in order to link to a particular string value in a document. A location relative to a search term can be returned, or an XPointer can return all the occurrences of a term. The following example provides a sampling of these features. In the first pointer, the first occurrence of Caitlin in the document is referenced. In the second, the first occurrence of Miller is found, but the pointer is set to six characters after that; since Miller is six characters long, the pointer would be set to the space or word following Miller. The last XPointer would find all occurrences of the word Bessie in the XML document. http://lane.stanford.edu/bibliography.xml#string(1, “Caitlin”) http://lane.stanford.edu/bibliography.xml#string(1, “Miller”, 6) http://lane.stanford.edu/bibliography.xml#string(all, “Bessie”)

76


Using XPointers to link to particular string values comes in handy if a document has large amounts of CDATA or contains XML that is not well-formed. In conclusion, many XML-related technologies are built on the foundation that XPath provides. Being able to reference parts of an XML document allows for XPointers to locate specific nodes and node ranges, and for XLinks to make use of XPointers to flexibly link into documents that are not owned by the library. XPath provides the foundations for an XML query language and the referencing scheme needed by technologies that display and transform XML into other formats. Of all the other XML-related technologies, perhaps none is as useful as XPath. However, XPath is a supportive technology; it is rare that XML users use it directly. To be able to create an XPointer or write a stylesheet that turns XML into XHTML requires familiarity with the XPath standard.

DOING IT WITH STYLE(SHEETS) Despite all of XML’s wonderful characteristics and dazzling technologies, very few library patrons would want to read raw XML to determine where their book is located. For this reason, something must be done with XML to make it more visually accessible to library patrons. Fortunately for all, this is one of XML’s greatest strengths. By separating the structure of information’s content from its presentation or display, XML enables librarians to repurpose library information for a variety of contexts. Suppose, for instance, that a library’s bibliographic catalog could output XML. Information about the serials that a library holds can be extracted in XML and then used in a variety of ways. PDF (Portable Document Format) files of serial holdings can be printed and handed out at the reference desk by the library’s reference librarians; the library webmaster can generate static HTML pages with the digital serials’ 856 fields converted into hypertext links for display on the library’s website; or, with a little assistance from the library’s JavaScript guru, a simple search engine can be created using an XSLT stylesheet to make the serials list available to patrons who do not wish to search the catalog. There are several ways that XML can be presented or “transformed” to make its content more accessible. The first way is to change XML into something that can be displayed in a library patron’s web browser; for instance, an XML fragment can be transformed into a PDF or HTML document using XSL. The second possibility is to give the patron’s browser a set of instructions for displaying the raw XML in a userfriendly manner; this set of instructions usually takes the form of an XSL (Extensible Stylesheet Language) or CSS (Cascading Style Sheets language) stylesheet. The first method of XML presentation is often referred to as “server-side” because the XML transformation usually takes place on a web server. The second is known as “clientside” because the processing usually takes place in the patron’s web browser. These distinctions are somewhat flexible, though, since an XSLT stylesheet, which is usually used for server-side display, can also be used on a client that has no connection to the


77

Internet. The main distinction is that, with the first, the XML document is transformed and, in the second, it is only formatted for display.

Getting Started XML is transformed into HTML or XHTML by means of a stylesheet. Using this set of rules, documents created under one schema or DTD can be converted into documents that use a second schema or DTD, with each element, attribute, etc., of the original schema being converted to an element or attribute specified by the second schema. This section will discuss three types of XML stylesheets: XSLT, CSS, and XSL FO stylesheets. XSLT (XSL Transformations) and XSL FO (XSL Formatting Objects) are actually subsets of XSL (Extensible Stylesheet Language), which defines how XML should be transformed or prepared for presentation. CSS (Cascading Style Sheets), which was originally designed for HTML, works with XML because both XML and HTML share a similar syntax and history. Generally, CSS and XSL are used for the client-side display of XML, and XSLT is used for server-side transformation of it. However, XSLT can be used on a client machine to generate a static HTML result from an XML source; or for a librarian or library application developer to experiment with what a server-side transformation would look like without actually having to load the document and its stylesheet to the server. Again, this distinction is somewhat arbitrary, so choose the best stylesheet option for the task at hand; in most cases this will depend on the type of data or information that needs presentation. Most XML that is conveyed over the Web is displayed after being transformed by an XSLT stylesheet. There are, of course, plenty of examples on the World Wide Web that use XSL or CSS stylesheets to display XML, natively, in the patron’s browser, but since support for the client-side display of XML is less common, transforming XML into XHTML on the server is currently the best way to format XML for presentation; XHTML, a well-formed version of HTML, can be displayed in almost any browser available today. If client-side display is desired, as in the case of full-text documents that can be downloaded to the patron’s computer and may be viewed when the patron is not connected to the Internet, CSS stylesheets are better. This is because many browsers support them. As XML becomes more widely accepted, XSL will, hopefully, be used more frequently as a presentation format. For the present, unless the patron is downloading a large XML document with a relatively small stylesheet, preparing an XML fragment for display in a patron’s browser, is probably best done with XSLT; it can then be served to patrons from the library’s web server.

XSLT Stylesheets Fortunately, working with XSLT is only slightly more complicated than working with XML. This is because XSLT, like most XML-related technologies, is implemented in XML. This means certain predefined elements and attributes are used by XSLT stylesheets to perform particular transformations on XML documents. One of the most common types of XSLT transformations is to transform XML into HTML or XHTML.

78


To illustrate this, let’s start with information that has been marked up in XML. In the following example, a serials list has been extracted from a library’s catalog for inclusion on the library’s website. Abdominal imaging Springer 175 Fifth Avenue New York NY 1 1996 Differentiation Springer 175 Fifth Avenue New York NY 60 65 1996 1999 Disease-a-month Mosby 11830 Westline Industrial Dr. St. Louis


79

MO 47 2001

The first step in transforming this XML data into HTML is to identify units of information that need to be displayed together. Determining how to display the information is a matter of determining the “granularity,” or size, of the data chunks. Some information may be finely grained, while other types of information may be more coarsely grained. For example, when getting an overdue list from a circulation system that will export XML, a librarian will probably be interested in small chunks of data. This data might include the patron’s name, a patron ID, and a list of items the patron has charged out. When pulling bibliographic information from a cataloging system that supports XML, a librarian will probably be interested in larger chunks of information: bibliographic records containing, among other things, authors, titles, links to holdings records, descriptions, publishers’ names, content notes, and possibly links to other related bibliographic records. In the above example, the main unit of information is the serial record. Since the serials are grouped in a container element called SerialList, the XSLT stylesheet should pull information at the Serial element level. In XSLT, for repetitive data, this is accomplished by using an XSL “for-each” statement. All XSLT elements and attributes contain the “xsl:” prefix because they belong to the xsl namespace. Since XSLT is a subset of XSL, any of the XSLT examples presented in this chapter should also work with browsers that support XSL. Most libraries, though, will probably want to generate a static HTML page from the XML; that page can be periodically updated by running a new SerialList XML fragment, from the database, through the stylesheet. [ . . .omitted to conserve space. . . ]

The XSLT statement in this example tells the XSLT processor to select the Serial element that occurs beneath the root SerialList element and to pull information from that level of the XML hierarchy. The XSLT processor is also instructed to select this position for each Serial element that occurs in the document. So, in the case where a document has many serials, like a serials list would, each is selected one after the other for processing by the XSLT processor. Once this is done, processing can be performed on this fragment of XML data. Returning to the example above of the serials list, all Serial elements will have the same structure, even though not all child elements are required in each occurrence of

80


the Serial element. In the example, an Enumeration may have a starting volume and date, but it need not have an ending volume and date because most of the serials displayed on the serials list are currently being received by the library. This would be controlled by the serial list DTD or RELAX NG schema. Since the schema formalizes the structure of the serials list, it can be tested for the occurrence of elements; the content of these elements can then be inserted between or into HTML tags. XML allows us to use the document’s formalized markup to generalize the data and present it for display: A Serials List

Vol. () vol. ().

The first tag in this XSLT stylesheet indicates that the document contains tags that belong to the XSLT namespace. Since all tags in the document belong to the XSLT namespace, there are no other namespace definitions in the root element. The second element in the namespace is the “xsl:template” tag. This tag marks the beginning of an XSLT action. There may be many templates in a single XSLT document. Another way of constructing the same result would be to use XSLT’s ability to create elements and attributes. The following example demonstrates this approach:


81

A Serials List white

This example should look familiar to readers with HTML experience. In it we demonstrate creating HTML using both methods (i.e., including HTML tags and creating them from XSLT statements) to show that the stylesheet’s author has a choice. Using HTML tags directly is more concise, but using the XSLT representation of an element or attribute allows for dynamically assigned names and values to be used if needed. In the XSLT example above, HTML elements are created by using the “xsl: element” element; attributes for the HTML elements are created by using the “xsl: attribute” element. The stylesheet that creates the HTML from XSLT is considerably longer than other stylesheet languages the reader might have seen, CSS for instance, but it is not too difficult to understand. Once the basic structure of the HTML document is built from the XML source, it is time to insert the XML data elements that will convey the important information of what the library owns to its patrons. To start, identify the desired level of granularity. In the following example, the Serial element is the main unit of information.
[ . . .omitted to conserve space. . . ]

This XSLT example finds every Serial element that is a child element of the root SerialList element and creates an HTML tag that corresponds to the title of each occurrence. The “p” tag in HTML is the same as saying, “Start a new paragraph here.” An HTML table could be created instead of simple paragraphs, but since not all readers will be familiar with HTML, the example is kept as simple as possible. Once the basic HTML structure and the title of each serial is output to the HTML page, information about which issues the library holds needs to be added. Observant readers might have noticed that the serials list in XML form that was depicted earlier also contains information about the journals’ publishers. While this information could be included on the serials list, it is not necessary. XSLT stylesheets are also good for extracting only the relevant information from an XML source; so if, as in this case, not all the original data is relevant, parts of it can be omitted from the XSLT transformation. One thing many libraries would want to include is the serial title. It is also important to be able to conditionally display ending dates and volumes if the journals have them. The following part of the stylesheet does this:

82

Vol. () vol. ().

By selecting the enumeration data using an “xsl:for-each,” the information from before and after a range of dates, which includes a gap in coverage, will be handled correctly. For each time that “Enumeration” is found in the XML data, the HTML will display the volume and date range on a line in the client’s browser. By inserting parentheses around the StartingDate and EndingDate and a hyphen after the StartingDate, the result looks like the structure seen in many library catalogs: Vol. 1 (1965)-v. 24 (1971). Unlike the information in the catalog, however, there is a clean separation between style and content.

XSL FO Stylesheets In addition to XSLT, the Extensible Stylesheet Language (XSL) specification describes a type of XSL that can be used to tell an XML-aware application how to display a native XML document. This is different from XSLT because the XML is not transformed into another format; the XSL stylesheet just instructs the displaying application how to display the data. When an XML fragment is processed by an XSLT stylesheet, both the stylesheet and the XML document are given to the XSLT processor. When an XML document should be displayed with a certain XSL stylesheet, the XSL stylesheet processor or browser needs to be told which stylesheet to use. This is done with an XML processing instruction that has an “xml-stylesheet” tag name. An xml-stylesheet tag also has attributes to tell the browser what type of stylesheet to use and the location of that stylesheet. Its location is indicated using HTML’s hypertext reference link. Abdominal imaging Springer 175 Fifth Avenue New York NY


83

1 1996 [ . . .other serials omitted to conserve space. . . ]

While XSLT should be used for information that is viewed online, XSL FO (XSL Formatting Objects) has significance for libraries that want to distribute full-text documents marked up in XML. An XSL FO stylesheet can tell an XML-aware application how to display the XML document on a computer that may or may not, at that time, be connected to the Internet. Though currently XSL FO is not supported by many applications, this might change as XML continues to grow in popularity. Now that an instruction to use an XSL stylesheet with the list of serials has been included, an XSL FO stylesheet that will display the list should be created:

84


The first element in this stylesheet again identifies the namespaces used in this document. Since this document uses XSLT and XSL FO tags, both namespaces are included in the root element. Also present in the root element is an attribute (“fo”) from the XSLT namespace that tells the XSL processor how to output the result. In this case, the result should be in the XSL FO namespace because display instructions for an XMLaware application are being created. Just like with the XSLT example, the next element is the template element indicating a stylesheet action is about to begin. The next section of the stylesheet shown above, the first elements from the XSL FO namespace, is new. In this set of elements there is a “fo:root,” indicating the root element of the document. Next, there are “layout-master-set” and “page-sequence” sections. The layout-master-set, as its name suggests, sets the master layout for the display. In the above case, the layout is a simple page, but more complex documents can have different master pages for first, left, right, front matter, back matter, and more. Next, the XML content is placed on copies of the generic master page using the page-sequence container. While these concepts may be new to many, those familiar with publication software will recognize the basic framework of ideas:

This page-sequence container (which is repeated here from the preceding example) contains a sequence-specification that specifies the order in which different page masters should be used. The next segment, the “flow” element, contains the actual content to be displayed. Within the flow element, the XSL tags loop through the serial list document and extract information, in this case just the serial titles. Just like in the XSLT example, the “for-each” element finds the Serial level in the SerialList hierarchy and then selects the title of the serial by using a “value-of ” element with an attribute to select the Title element in the selected serial. What is unique to this stylesheet, though, is that once the serial’s title is selected, display instructions for an XML-aware application are created.


85

The XSL stylesheet says that each title should be treated as a single block of information. This information has characteristics that are defined as attributes of the “block” element. As was stated in the first chapter of this book, attributes are best used to describe metadata about an element. The attributes that are on an element in the “fo” namespace are also a part of the XSL FO specification. By adding these attributes, the XML-aware application is instructed how to display the title in a fourteen-point font using a font from the serif family of fonts. Since this is encapsulated in a “for-each” loop, each title will get the same treatment. XSL FO stylesheets hold great promise, but for the time being, their most practical use is to describe how an XML document should be turned, with the assistance of an XSL FO-aware application, into a PDF file. Unfortunately, support for XSL FO in the browser market is incomplete. If a library needs to display XML documents on the Web using a client-side solution, its best bet is to use a Cascading Style Sheet. CSS is supported in all modern browsers and in many older versions of the most popular browsers as well. Probably the best solution for displaying XML on the Web, though, is to use XSLT to transform it into XHTML.

CSS Stylesheets CSS (Cascading Style Sheets) was created as a way to add style to HTML documents. CSS was only moderately successful at this because HTML already includes, in its list of tags, elements that inform a browser how to display the content. XML, on the other hand, does not have any elements that dictate how content should be displayed. As a result, CSS can be used to control display preferences without having to worry about conflicting with a tag in the marked-up content. Any CSS style can be applied to any XML tag. Take, for instance, the serials list from the previous two subsections. To display that list on a patron’s browser using a CSS stylesheet, an XML processing statement must be included at the beginning of the file to indicate that this file should be displayed using the selected stylesheet. The processing instruction looks just like the one added for the XSL stylesheet’s example, except that its type is “text/css” instead of “text/xsl.” Abdominal imaging Springer 175 Fifth Avenue New York NY

86

“The Nice Thing about Standards . . .” 1 1996 [ . . .other serials omitted to conserve space. . . ]

Unlike XSL FO and XSLT, CSS is not expressed in XML. This is because CSS was invented before XML. Unfortunately, what that means is that if CSS stylesheets are to be used with a library’s XML documents, someone in the library will need to learn a different format. Fortunately, since CSS stylesheets are used with HTML, the library’s webmaster may already have experience with them. Though CSS is better supported in browsers than XSL FO, some browsers do not fully support the CSS specification. In fact, frequently one browser supports some CSS styles and another browser supports other CSS styles. Before applying a CSS stylesheet in production, be sure to check the results of the display on a variety of web browsers and browser versions. Since displaying XML with a CSS stylesheet is similar, in many ways, to displaying XML with an XSL FO stylesheet, many of the same ideas are included. Unlike an XSL stylesheet, however, CSS stylesheets are very condensed. There are not a lot of structural tags, just the rules for formatting the content. This has advantages and disadvantages. While concise stylesheets will travel over the Internet quicker, their not possessing a uniform structure, like an XML document does, means that changes to the stylesheet must often be made by hand rather than relying on tools that can automate the process based on the stylesheet’s structure. The first CSS stylesheet is one to display a serials list in a library patron’s browser: SerialList { display:block; border-style: groove; border-width: 1px; padding: 1em; width: 4in; height: 3in } Serial { display: block; margin-bottom: 10px; background-color: #CCCCCC; position: relative; padding: 1em }


87

Title { display: block; font-size: 14pt; font-weight: bold } Name { display: block; text-indent: 0.5in; text-decoration: underline } Street, City, State { display: block; text-indent: 0.5in } Enumeration { display: block } StartingVolume, StartingYear { color: green } EndingVolume, EndingYear { color: red }

The first thing one might notice about this CSS stylesheet, other than the fact that it does not use XML, is that the style is assigned based on the names of the elements in the XML document. Some elements, like Name, have their own style notations while others, like Street, City, and State, share their styles. This is a convenience so that stylistic information does not need to be repeated for each element to which it should be applied. The next thing that should be apparent is that almost every element has a “display: block” style. This tells a browser to display this unit of information together with its child elements. Each block of information appears on its own line when viewed in the patron’s browser. There are, of course, many more types of CSS styles than can adequately be explained in this section or even in a single chapter. There are styles that can be applied to pages, backgrounds, and most any other type of page layout, but they will not be discussed here. To learn more about these other options, visit some of the websites listed in this book’s bibliography. The two types of styles discussed here pertain to blocks of information and textual data within those blocks. The first example demonstrates a few of the CSS styles that can be applied to textual data in an XML document: Title { font-size: 14pt; font-weight: bold } Name { text-indent: 0.5in; text-decoration: underline }

88

“The Nice Thing about Standards . . .” StartingVolume, StartingYear { color: green } EndingVolume, EndingYear { color: red }

In this example, the Title element contains text that needs to be emphasized so that it will stand out from other information on the page. Since the title is the main unit of information on a serials list page, this makes it easier for patrons to scan the list. To do this, CSS includes font characteristics that can be set. In the above example, the size of the font is increased and emphasized. To further highlight the title, the text that immediately follows can be indented, making it easier to scan down the list and only look at titles. To do this, the CSS “text-indent” feature is used. The amount of spaces that a browser should indent the text follows the colon. In this case, a half an inch is enough to draw attention to the title. XML can be formatted in many of the same ways that text can be formatted in HTML. In HTML, there is an underline tag that indicates its content should be underlined. With XML, the content of any element can be underlined by assigning a CSS property to that element. In the above example, one can see that the text-decoration of the publisher’s Name element indicates its content should be underlined. Colors can be assigned to text, much like one would do with the “font” HTML element. By grouping the StartingVolume and StartingYear elements, a green color can be assigned to them; another color can be assigned to the grouped EndingVolume and EndingYear elements. Did you notice that, in the earlier XSLT example, parentheses and a hyphen were added to indicate which numbers were volumes and which were years? Since XSL FO and CSS stylesheets only assign style to the value of an XML element, this is not possible with CSS or XSL FO alone. XSL FO does have the advantage, as seen in the XSL FO examples, that it can be combined with XSLT instructions, giving it a slight advantage over CSS. The other type of styles assigned with the CSS stylesheet in these examples is the type that will be applied to blocks of information. Blocks of data are represented as information units by line breaks. If an element does not have any stylistic characteristics apart from those of its child elements, it may still receive a “display: block” value to indicate that all of its children should be displayed together on one line. This is the case with the Enumeration element in the example below. It is a container element that serves to group its child elements, starting and ending values, together. Applying a CSS “display: block” value to the Enumeration element will display the dates and volumes as intended. SerialList { display:block; border-style: groove; border-width: 1px; padding: 1em; width: 4in; height: 3in }


89

Serial { display: block; margin-bottom: 10px; background-color: #CCCCCC; position: relative; padding: 1em } Enumeration { display: block }

This example illustrates that blocks of information can be formatted based on their position to each other and to the page as a whole. The Serial element has a position that is relative to each Serial element that has been displayed before it. This means that, as each serial unit is prepared for presentation, a margin can be set to apply to the Serial that will follow. By setting the Serial elements with a bottom margin of ten pixels (a display unit used by computer monitors), a space is created between each. This makes reading the serials list easier. A Serial block can also be highlighted by setting the color of its background to a color different than that of the page’s background color. This will emphasize the blocks on information and make individual serial units easier to read. If this is done, however, a layer of padding needs to be added between the text in the block and the shaded box that is created by this effect. In addition to the border effect created by shading the background of each serial with a different color than the background of the page, it is also possible to create actual boxes that might help distinguish the list from other things on an HTML page. By assigning the root element a “border-style” and “border-width,” a line can be created around the serials list. In the above example, a grooved box was created, but a box with a solid line or one with a dotted line could also have been created. As illustrated with the “virtual” box, created by changing background colors, this real box needs to have a padding set to make its presentation a little more pleasing to the eye. Once this is done, the last thing to set is the size of the root block, which is now enclosed in a grooved line box. This simple example of how to display an XML document with a CSS stylesheet is complete once the width and height of the SerialList element have been assigned. Unfortunately, the dates and volumes do not look as nice as they did in the XSLT example. This is because XSL alone and CSS stylesheets do not add hyphens and parentheses to the data. By adding display components to the data, the ability to reference content and style independently of each other is lost. More browsers support CSS than XSL; if a library needs to display XML on a patron’s browser in an environment where the patron would not have access to a web server, CSS might be the better choice. On the other hand, the library’s XML could be turned into HTML, or a PDF, with the assistance of an XSLT stylesheet. The choice depends on the needs and resources of the library’s patrons. To conclude, there are a wide variety of XML-related technologies. These technologies help programmers work with XML and help patrons make sense of the information in their web browsers. The technologies assist web developers who want to link

90


into documents they do not own; they facilitate libraries’ ability to share their formalized data models with other libraries and Internet-accessible information/data centers; and they provide a consistent referencing scheme that can be used as the foundation for an advanced query language, or any number of other XML technologies. They do this by building on the flexibility and strengths of XML.

In the Scheme of Things

hile many projects may rely on ad hoc XML document structures, a formal schema is paramount for more serious or longer-term endeavors. A schema identifies the elements and attributes that are permitted in a document and specifies their allowed sequences and nesting. It thus defines the rules by which an XML document is structured. Schemas are particularly useful for validating documents that contain large amounts of data, since they ensure that the data is in the correct formats and that all of the required data is present. Even temporary projects might benefit from the exercise of schema development, since the schema’s definition of elements and its review of their relationships results in a more considered and potentially more stable document structure. Moreover, such documents may eventually need to be extended or integrated into other XML structures; preliminary planning will go a long way toward making schemas, united via XML namespaces, function together effectively. More formal efforts in a field such as library science must carefully consider the implicit ramifications of the many decisions made during the development of a schema. Greater or lesser flexibility and extensibility are inherent in any schema’s design. Furthermore, the context in which a schema exists is fundamental. Existing or potential schemas and inter-schema relationships need to be considered. Careful schematic decisions may permit the reuse of information in novel ways. XML transformation tools provide a degree of flexibility in that they make it relatively easy to transform one XML structure into another—within the limits of granularity and coherence of the initial structures. However, the definition of the same, or similar, elements in related schemas remains critically important. Ideally, a profession would take responsibility for the development and coordination of a suite of schemas relevant to its own particular field. At this early stage in a transition to XML-based information systems, it is likely and desirable that various alternative schemas will develop within any one field. This is a healthy step in the exploration of the range of possibilities, and pitfalls, in realizing a truly robust suite of

W

91

92


schemas. Unforeseen products and services may be developed more readily, or with greater difficulty, depending on the decisions made. Rather than just attempting to translate existing knowledge structures directly into XML, we have a strategic opportunity to redefine these structures in order to support future information systems. Combining the best of what we have learned from years of data definition and library automation with XML’s advantages could result in a more uniform infrastructure upon which to build systems now and for years to come. The proverbial ounce of prevention in the upfront effort of schema development may be worth many pounds of cure for information systems down the road. Because XML’s use is growing exponentially, it is inevitable that information structures based on today’s decisions will likely encounter, and necessarily have to interact with, larger XML structures in the future. It is ultimately a matter of infrastructure design, a design in which librarians need to take a more expansive view and a more involved role. Schema development is an intellectual exercise and requires a knowledge of content and the functional relationships between data structures. This is commonly accomplished by technical staff who are familiar with either libraries or application programming, but ideally would involve librarians with a technical bent. Alternatively, librarians can work closely with technical staff to achieve a balance. Knowing the library field is more important than knowing XML. Librarians with a basic knowledge of XML can play significant roles in schema development, even though librarianprogrammers are likely to play the most influential roles. The main task is to get under way. The rapid prototyping of a schema often reveals whether an idea will work. Problems also look different once partial solutions are in place. Creating smaller, and thus easier, schemas can lead to confidence in developing larger ones. One schema will lead to another and raises issues of how multiple ones interrelate. As XML begins to pervade an operation, the need for coordination of efforts with similar endeavors both within and between institutions will arise. Even developing a modest schema leads inevitably to the issue of standards. Reaching agreement on standards is often an arduous process. Existing standards in libraries have been honed over time. These standards have affected the development of online systems, which in turn have influenced the standards once practical applications of the systems were better understood. However, the circumstances have changed again. With the advent of the Web, this process is now bringing long-separate systems into virtual proximity. Each step forward has made the need for tighter integration and closer cooperation more concrete. This pattern of change in response to system capabilities suggests that it should be possible to agree upon standard schemas more readily in the future. These need not be rigid schemas that limit our possibilities. Instead, new schemas can be designed that promote a common infrastructure and have sufficient flexibility to allow and even encourage extensions to meet new needs as they arise. Successful strategies will be incorporated into standard schemas, in much the same way that web technologies continue to evolve. The aim of this chapter is not to explain the nuts and bolts of any particular schema language, as these can far exceed the level of complexity of XML. Instead, it briefly sketches what is involved in schema development and tackles the more complex issues


93

of context. These issues focus on the library environment and stress the importance of interworking schemas. Many schemas are needed, but the core one, relating to bibliographic resources, is the one most critical to the effective communication of library information on the Web. Because current practices are entrenched, we devote considerable space in this chapter to an analysis of MARC and AACR, and we identify various problems in these cataloging standards. We contrast these with XML-based solutions, and we advocate the need for fundamental change in order to achieve a viable replacement schema. The chapter concludes with an overview of XOBIS, an exploratory schema developed using the RELAX NG schema language. XOBIS takes a fresh look at the issues concerning web-oriented management of bibliographic and authority information, and is illustrative of schema development in general.

SCHEMA DEVELOPMENT Creating a schema is a lot like solving a puzzle. Puzzles have many similarities, but because each puzzle is unique, there is no one solution that fits them all. An initial, crucial step is to establish a schema’s outside boundaries or constraints. This determines what pieces will fit and makes it easier to find the best fit for the remaining parts. As with some puzzles, there may be more than one solution that works. Others may be like brainteasers, where the apparent solution to one part of a problem is incompatible with the solution to a dependent problem. Scrapping both approaches and devising a solution from a new angle may be in order. Determining which solution is the “best” one for a real-life puzzle is more of an art than a science. Like solving some puzzles, schema development can be frustrating at times, but also fun and rewarding if approached with the right attitude. The structure of XML documents can become quite complex, although they are composed of a limited set of simple XML building blocks. XML schema languages work in much the same way. Achieving a workable schema is somewhat of a balancing act between structural and functional demands. Adhering to XML rules and good practices results in sufficient structural cohesion needed to validate documents associated with the schema. After all, validation is the primary function of a schema. However, unless the content’s structure is optimized to meet the functional requirements of a given context, the schema may not adequately address the problems it was designed to solve. Creating a workable schema can be extremely challenging and requires a specific knowledge of one’s field, a detailed familiarity with data relationships, and a thorough understanding of associated applications. To give the reader some idea of what is involved in schema development, the following steps in the process offer some general guidelines based on the authors’ experience. Depending on the scope of a schema, there can be many false starts, blind alleys, and redesigns. To avoid disappointment, it is best to start off squishy and not try to firm things up too early in the process. The steps listed below are closely intertwined and cannot be considered too independently of one another. Schema development is seldom linear.

94


The Development Process Step-by-Step Review

1. Review the functional requirements and the planned purpose(s) of the data and documents that the schema will validate. Questions will inevitably arise that cannot be answered effectively based on technical considerations alone. Understanding environmental factors helps focus issues as the schema evolves. Is the solution a quick and dirty one, or is it part of a larger effort? XML structures can be reused. It is never too early to think about their modularity. 2. Review existing schemas, DTDs, and alternate methods of encoding the data/ documents. A simple database or even a spreadsheet might suffice in some circumstances. Others may have already developed a schema for the same purpose. Although related schemas may not provide a precise solution, they may contain structures that can be emulated without having to completely reinvent the wheel. Currently, there are many document models available, some more established than others. There will necessarily be a period of exploration. At this early stage of XML adoption in libraries, do not be afraid to try different approaches. XML is a new medium and offers librarians opportunities to help in the development of many needed schemas. XSLT transformations should permit the mapping of data from interim schemas into more robust ones as they emerge. Design

3. Choose a schema tool. Document Type Definitions and two XML-based schema languages—XML Schema and RELAX NG—are covered in chapter 2. The authors recommend the RELAX NG schema language because it is the least complicated of the three. 4. Determine the contextual boundaries that the proposed schema will encompass. What at first appears to be a single schema may be better rendered as two or three separate schemas interrelated by namespaces. Divide and conquer is a good rule of thumb, particularly when a schema construct may be reused. It is easier to combine schemas for another purpose than to split an existing one after it is in use. On the other hand, choosing too small a context for a schema can lead to trouble by making it easy to ignore relationships to other areas; inevitably another project arises that overlaps and disrupts the original schema. Little information in libraries exists out of context. 5. Analyze a wide variety of sample data and existing documentation to identify core elements and groupings of elements within them. Think atomically. Discrete elements and groups of elements are easier to assemble into larger structures. They are also easier to address when the resulting XML documents are processed. Clarity is paramount. Defining the same or similar elements differently in different parts of a schema will lead to trouble. Do not overlook implied elements that may not occur in sample data, or that may be “understood” but not well documented, if at all. This step can be thought of as an inventory of the elements to be defined in the schema. 6. Analyze the relationships and interplay between the element groups identified. Consider both the known and potential interactions and relationships between the


95

elements. This is easier when looking at container elements, where the substructure gets carried along for the ride. This process can help identify overlapping constructs, potential conflicts, and opportunities for improved clarity. 7. Analyze the functional implications of decisions in order to provide as much flexibility as possible. How will the constructs being devised support a particular envisioned functionality using the data or documents? A different combination of elements and attributes may support more than one function just as easily as two separate constructs. Logical organization, clear boundaries, and reasonable granularity go a long way toward supporting functionality even when the potential uses of the documents are not known. 8. Identify patterns and principles in the process and try to continually achieve economy and coherence of expression. Focus on key components first. Do not be too quick in fixing structural decisions; especially when dealing with larger structures, it may take a while for patterns to emerge. Systematically working through all the parts of a schema before trying to finalize one’s decisions permits greater flexibility in reorganizing the structure as new relationships are identified. Focusing on each part will bring new factors to the forefront. Principles may emerge that can inform apparently unrelated decisions and make for a crisper design. Check to see if a pattern recognized in one part applies in others. Be patient. The number of components is usually limited, and with a little doggedness things eventually seem to click into place using such an iterative process. Test

9. Mark up examples of the data or documents as early as practicable. Sometimes ideas do not look as good in practice as they seem in the abstract. Actually seeing how different approaches look as markup may change decisions. Because an XML document has an implicit document model, it is possible to reverse-engineer a schema based on comprehensive sample documents. Alternating these pragmatic and abstract approaches can help hone a schema. This might also be a good time to begin exploring stylesheets. 10. Test the “completed” schema and consider making it available to the XML and library communities for review and feedback. The real test of a schema will come as XML documents are created using it. Are all the required elements and attributes reasonable? Has anything been omitted? It is not uncommon to continue to tweak a schema well into the implementation stage. After internal review, sharing the schema widely will allow others to benefit from your experience and can provide valuable feedback from different perspectives. Implement

11. Once a schema appears stable, focus on populating the structure with data or text. Editors may be developed for specialized data entry. If data exists in another format, it can be mapped to the new structure; this experience is another opportunity to shake down the schema. However, these processes need to conclude for a stable

96


implementation. Editors and maps should be designed flexibly, but stability should be the goal at this stage. 12. Use identified software or develop interfaces to retrieve and present documents. Once documents are in a stable XML format, there are myriad possibilities for what can be done with them. Different schemas may require different approaches. Individuals’ preferences will vary in tackling schema development. Try out different ideas and see what works in your situation. It is important to get your feet wet. A little experience goes a long way toward building confidence and changing your impressions of what a problem looks like. Appearances, especially to the novice, are deceiving. To get under way, talk to those who have some experience. Advocates of XML in libraries are always interested in sharing their expertise. Even those with considerable experience can benefit from checking to see if someone else has already solved a problem. The XML4Lib electronic discussion list is an excellent place to get started and to find answers if you are stuck (XML4Lib 2001– ). Schema decisions will be colored by how permanent a solution is intended. A modular structure with reusable components will pay off over time. Developing an overall strategy for gradually migrating to XML will require a greater consideration of contextual issues.

LIBRARY INFORMATION IN CONTEXT A context tends to be complex. Much of this complexity derives from overlap— differing factors, views, interests, coverage, etc., relating to the same entities. The many different ways of slicing the pie invite collision or conflict. Rather than trying to reconcile troublesome issues, problems are typically avoided by carefully selecting, narrowing focus, isolating, postponing, or otherwise ignoring the problematic realities of a given situation. Such strategies may permit considerable progress to occur. When issues cannot be ignored, resorting to compromise can serve to diffuse conflict, especially when perennially popular political, economic, and turf issues are involved. However, there are no truly safe havens. Previously non-controversial endeavors can be thrust into the limelight when conditions change.

Open Content Conditions for libraries have changed! The introduction of the World Wide Web has dramatically and irrevocably altered the information environment. Once-comfortable methods and arrangements and seemingly clear boundaries have vanished or necessarily must be questioned. The library world is no longer so discrete. However, the new threats can be looked at as new opportunities. With instant information everywhere, libraries need to reassess their role and focus on strategies for thriving under the new circumstances.


97

The outlook can sometimes appear bleak. The current, and still unresolved, crisis in scholarly journal publishing is a vivid example of the clash between nonprofit and for-profit interests over access to information. A bitter controversy has pitted academic libraries against commercial publishers over the continually rising prices charged by the latter for subscriptions to scholarly journals, especially those in science, technology, and medicine. Libraries have found it increasingly difficult to accommodate the everrising cost of these journals in their budgets. In response, scholars and librarians have banded together in several cooperative ventures to publish reasonably priced, highquality journals on a nonprofit basis. These alternative publishing outlets aim to cut out the commercial publishers that, in effect, acquire the copyrights to scholarly authors’ work and then sell it back to university and college libraries in the form of scholarly journals at exorbitant markups in price. Perhaps the most ambitious of these open-access initiatives is the Public Library of Science (PLoS), an organization of scientists committed to making the world’s scientific and medical literature a public resource. PloS is poised to challenge the traditional publishing model by charging modest fees to authors after peer-reviewed papers are accepted for publication. PLoS has received $9 million from the Gordon and Betty Moore Foundation and endorsements of open-access publishing (with financial implications) from the Howard Hughes Medical Institute and the Information Program of George Soros’s Open Society Institute. PLoS is prepared to waive the fees when necessary as an operational expense. The published papers will be freely available for all legal purposes, with copyright used as a tool to provide authors with control over the integrity of their work and to ensure proper acknowledgment and citation. Look forward to PLoS publications appearing in the second half of 2003. Peter Suber, a champion of open-source access, maintains the Free Online Scholarship website (Suber 2003) and has written a recommended overview for librarians (Suber 2003b). Other open-access efforts involving scholarly communication include the British open-source publisher BiomedCentral, the National Library of Medicine’s PubMed Central open archive, the Budapest Open Access Initiative, the Scholarly Publishing and Academic Resources Coalition (SPARC), the Open Archives Initiative, the Coalition for Networked Information, and the Digital Library Federation. Stanford University’s HighWire Press has encouraged society publishers to make their retrospective content freely available.

Open Libraries? The fragmentation of access to content that open-access initiatives target extends into the fabric of libraries, especially regarding metadata. Library catalogs are usually separate from periodical indexes, which are separate from one another. Individual libraries maintain separate catalogs. Integrated library systems (ILS) have attempted to provide unified views of locally available resources, but each proprietary system has done so in different ways. As a result, various separate licensed resources shadow the ILS, often having their own search systems, discrete user files, and notification services. Sometimes valiant local efforts have succeeded in providing additional routes to

98


unique content via websites or specialized databases. Despite these laudable individual efforts, the aggregate variations are far less than optimal. The overall effect tends to be one of duplicated effort, incompatibilities, increasing fragmentation, apparently arbitrary restrictions, and a confusing degree of overlap in the various approaches. Given the realities of highly mobile populations, an emphasis on interdisciplinary studies, the deemphasis on resource location inherent in digital resources, and the competition from other fields and the commercial sector, libraries need to consider the future ramifications of following this “information silo” model. In struggling to address many of these limitations, libraries have variously joined bibliographic utilities, formed consortia, created union catalogs, merged databases, loaded records from resource files, licensed content, and developed various web interfaces in efforts to create broader and more coherent offerings. But coherence has remained elusive. Economic factors rather than user satisfaction continue to drive many decisions. Some silos are free, some charge fees, and some are subsidized. A new twist in this scenario is free and direct access to the bibliographic resources of national libraries, although barriers to copying these records remain. An alien visiting Earth might wonder why we have free access to metadata for paid resources, paid access to metadata for free resources, and numerous permutations relating to complicated fee structures, partial access (e.g., abstracts only), date of material (either current or retrospective free), etc. Are we trying to facilitate and stifle access to information simultaneously? Do the old access models work in a significantly different digital environment? Could the same effort and expense be leveraged to produce more effective bibliographic access? What will work best for libraries in the open digital environment of the Web remains to unfold. It may help to think of libraries collectively as needing enterprisewide solutions. Centralized solutions building on the success of the bibliographic utilities, or perhaps a less rigid distributed database model, might flourish. Customized topical portals or interfaces for subsets of users might also be effective. In any case, a more homogeneous substrate for managing information could support various subset views of a more coherent whole. Such individual views could continue to address the aims for which the individual products were separately developed. Although the high-level context can be bewildering, there are three major areas where XML as the lingua franca of the Web can help with interim strategies. Investing in XML solutions at each of these complementary levels enhances the likelihood of their supporting more integrated solutions in the future. Content. Document-centric XML formats promote the consistent markup of fulltext articles, archival finding aids, books, etc., in digital repositories or local websites. Standards in this area are beginning to gel, so there is no need to reinvent the wheel. To avoid problems of impermanence, metadata embedded as content may best represent a source of uniform and reliable content by which to support the automated extraction/harvesting of metadata maintained separately. Metadata. Data-centric XML formats provide added value for the bibliographic control of digital resources, local and remote, and traditional library materials. This metadata differs from embedded metadata in being more readily subject to ongoing


99

management. However, the scope should be more inclusive than traditional cataloging. New standards are needed. Debate continues over whether this type of XML should serve as a storage format or as an intermediate format generated on-the-fly from relational databases. Access. XML-associated technologies make it easier to build digital libraries, information portals, web interfaces, etc. Most of these tools also use XML to leverage the homogeneity afforded by content and metadata that share the same syntax. Software applications are easier to design, alter, and redesign for new purposes without necessarily having to change existing data structures. It is likely that XML access and processing technologies will continue to improve, while underlying data and document structures will be honed and gradually stabilize. This will protect investments in XML markup. In the meantime, librarians should work toward the coordination of content, metadata, and access beyond individual libraries. Forging ahead independently is not unduly risky, however, since XML’s design facilitates transformations based on the predictable characteristics inherent in XML, regardless of the source.

Think Globally, Act Locally The issues are much the same on a smaller scale, and even within the microcosm of a single library. What appropriately belongs within the scope of a single schema can vary considerably depending on the treatment of related information. Many libraries are at an awkward stage of transition to digital resources and face the more difficult proposition of managing traditional operations while trying to launch new initiatives. Figure 3-1 identifies some potential library schemas and the functional relationships that might exist between them in such a transitional environment. Each local environment has its own characteristics. Understanding how the existing components are organized and interrelate will make it easier to plan new initiatives. Redundancy between existing data and new web solutions may be unavoidable, but maintaining separate troves of information in different formats can lead to maintenance problems. This is especially true of volatile information, such as addresses. Consider an academic library that wants to develop a faculty profiles website. Records for this population can be siphoned from a data flow from the university, supporting the library’s Circulation module, to form a base with ongoing updates. Salient information may reside in an authority file. Linking between authorities and the profiles could leverage the investment. Another such example relates to a Circulation Reserves module. Its academic course listings could form the basis for linking additional resources on a website. Other challenges relate to listing web resources separately from the catalog, maintaining web authentication separately from user files, and storing customized user profiles for recurrent searches. These may be derived or coordinated to avoid confusing users. The structured data in all these cases can be converted to XML rather easily. But this kind of coordination can be a challenge, especially when considering integrated library systems.

100


Many library web projects overlap with areas covered by ILS systems. Each ILS, with variable degrees of integration, tends to function in isolation. There is uneven support for ad hoc interfacing; Z39.50 connectivity for searching falls short of its unifying promise; and proprietary interfaces for record import/export, etc., vary in effectiveness. ILS vendors have had to scramble to cope with the introduction of the Web, particularly in the areas of licensing, authentication, and statistics relating to digital usage. More flexibility may materialize as vendors begin to utilize XML. Future ILS systems would benefit from defining clearer, open communications between components, rather than tightly integrating them as inseparable ones. Ideally, a suite of schemas dealing with the common threads in all kinds of library information would foster better management of the underlying data and documents. Otherwise, the array of solutions resulting from different integrated library systems, various stand-alone licensed aggregated resource suites, and local and cooperative web initiatives portend increased fragmentation and confusion. Better integration and more

Figure 3-1 Possible Library Schemas in a Transitional Environment © Lane Medical Library, Stanford University


101

coherent solutions could take advantage of XML’s reusability. This would facilitate the creation of customized interfaces for selected subsets of resources. Each could be tailored to meet the different and changing needs of various audiences and projects. Despite appearances, the underlying information is more alike than not. Bibliographic and authority control issues, which are usually at the core of the problem, are discussed next.

BIBLIOGRAPHIC AND AUTHORITY RECORDS The Web brings traditionally discrete information into virtual proximity. Much of this information is bibliographic and authority data. Because of its centrality, the remainder of this chapter focuses on this critical core. First, we include a brief overview of selected schemas. Then, because MARC and AACR widely prevail in governing the creation and encoding of such data in libraries, we include extensive coverage of their implications for XML schema development. Lastly, an introduction to the authors’ experimental schema, XOBIS, explores the potential for achieving a new XML schema, better suited to the digital environment of the twenty-first century. The National Library of Medicine (NLM) has demonstrated international leadership in its early, extensive, and rapid adoption of XML, which it uses as a format for the dissemination of its millions of records and for some internal communications. The MEDLINE Document Type Definitions (DTDs), begun in 1999, were followed by one for Medical Subject Headings (MeSH). In the transition, the MEDLINE DTDs were enhanced for improved inter-record relationships. These stunning developments emphasized indexing records and focused on a unique library’s special needs. While informative, the result was thus not directly adoptable by other libraries. However, the NLM’s latest effort, the Archiving and Interchange DTD, promises to become a standard format in which publishers and digital archives can exchange full-text journal content (NLM 2003b). The Library of Congress (LC) has a long history of work with markup languages. Its continuing XML activity further recognizes XML’s growing importance in the library community. The LC explored SGML’s possibilities from 1995 to 1998, and then developed an XML schema that consisted of a literal mapping of each field in MARC to a counterpart element in XML. Each indicator became an attribute of that element. The Metadata Object and Description Schema (MODS) appeared in 2002 (LC Network 2002c) and regrouped some MARC fields. It also burst many encoded fixed-field values into explicit terms. MODS is similar to the Dublin Core, ONIX from the publishing industry, and the Open Archives Initiative schema in covering only a subset of MARC tags. Some of these are lumped into single elements with a loss of subfielding. MODS is promoted for descriptive metadata in conjunction with the Metadata Encoding and Transmission Standard (METS), which is designed for other metadata involving the use and preservation of digital resources (Guenther and McCallum 2003). In June 2002, the MARCXML format appeared (LC Network 2003). It is remarkably similar to the initial LC effort. But literally encasing MARC in XML fails to take full advantage of XML, limiting its flexibility in manipulating the resultant markup.

102


International interest in bibliographic schemas using XML has remained high since Lam’s work in 1998 in Hong Kong (Lam 2001) and the French BiblioML, released in 1999. South Korean acceptance of XML has been phenomenal, especially due to its selection as the standard for electronic documentation in an e-government project in 1999. Related papers on XML schemas for bibliographic data continue to appear from Italy, the Netherlands, Portugal, Spain, and elsewhere. The Dublin Core, spearheaded by OCLC, is useful in emphasizing that documents should contain basic, discrete metadata (e.g., date, title), but it has limitations. The overlap in its Creator, Contributor, and Publisher elements is problematic; each appears to represent the relationships of people or organizations to works. The late addition of attributes has also caused some confusion. While useful for some purposes, the Dublin Core lacks sufficient detail to adequately accommodate ordinary complexities of bibliographic information. The schemas resulting from many community-specific efforts, such as that of the Health Education Assets Library, tend to be enumerative and prescriptive. The Visual Resources Association’s schema is noteworthy in combining simplicity and the crisp delineation of fundamental elements, as well as better provision for relationships than is usual (VRA 2002). References to schemas and related efforts not cited here appear in the Medlane Project bibliography (Lane 2002). It is also telling that many related schema development efforts are not occurring in libraries. After a period of exploration and implementational analysis, we anticipate that the emphasis will shift to coordination of the many related, exploratory efforts. MARC and cataloging rules have recently been under scrutiny, particularly in the Functional Requirements for Bibliographic Records, or FRBR (Delsey 2002; IFLA 1998) and an authorities extension (FRANAR). These admirable efforts provide an immense amount of information regarding the complex structure of MARC and cataloging rules. They identify many core concepts and issues, but while exhaustive and very informative, they take a more traditional approach than might be warranted in the coming era of digital libraries. Significant efforts in the archival and museum communities highlight the need for the coordination of all types of cultural heritage information. The Encoded Archival Description (EAD 2002) and companion Context (EAC 2003) and the Conceptual Reference Model (CIDOC 2003) parallel library efforts.

MARC: The Ultimate Crazy Quilt? MARC’s economy of expression stems from its origin in the 1960s as a vehicle to automate catalog card printing. At the time, data space was precious. This pioneering effort to develop a shared data format allowed libraries to be among the first to participate in distributed database development, beginning with Frederick Kilgour’s efforts at the Ohio College Library Center. With a history approaching forty years, MARC has necessarily grown to accommodate a broad array of data—much more than was originally considered in its design as a communications format for bibliographic data. Having been honed over time, it continues to serve diverse purposes and has a cadre of systems and technical-services


103

librarians familiar with many of its nuances. However, MARC’s broadened scope and gradual accretion of changes have introduced inconsistencies and complexities, which remain despite efforts at format integration and harmonization. MARC has continued to absorb changes within its idiosyncratic structure, but the effective limits of this process have been reached. MARC is becoming less coherent and is showing signs of entropy. MARC might be characterized as a big, old, rambling, yet comfortable house. Despite many additions and much remodeling, its age is increasingly apparent. New technologies relating to the World Wide Web have introduced fundamental changes in computing that cannot be ignored. Recent efforts to modernize MARC have mostly involved a literal, uncritical translation of its fields, indicators, and subfields into XML’s elements and attributes. Unfortunately, this incorporates MARC’s inherent problems into the new medium and prevents taking real advantage of XML. For libraries to remain competitive, librarians must expand their comfort zone. The time is overdue to stop remodeling and design a new edifice—one that can incorporate the best of what has been learned over the past forty years about data definition and functional relationships into a new, more sustainable framework designed for the World Wide Web. As with self-help programs, the first step is to admit that we have a problem. To help clarify the nature of this problem, the following sections illustrate some problematic features of MARC and contrast these with potential XML solutions. Discussing the particulars of MARC is challenging due to many conditional meanings, dependent on other data in a record, and the relatively low use of the majority of the fields. It is also difficult to discuss MARC without drifting into AACR2 cataloging rules, as coding and content are so intimately related. The library community has thus far tried to keep the two distinct: witness the American Library Association’s MARBI and CC:DA committees. MARC seems to have taken on a life of its own, far beyond encoding information expressed in a record constructed according to AACR. Excessive Encoding and Control Fields

Until October 2001, MARC bibliographic records belonged to one of seven “formats” within the “integrated” bibliographic format dating from 1988: books, serials, computer files, maps, music, visual materials, and mixed materials. Some bibliographic utilities further distinguished sound recordings from music, since not all of the former are musical. To determine which of these formats applies to a particular record requires conditional logic involving MARC’s Type of Record (000/06) and Bibliographic Level (000/07). Integrated library systems categorize bibliographic records using logic along the line of “if the value of the Type code represents language material, then the value of the Level code determines the format, else the value of the Type code determines it.” Accordingly, the value “am” represents language material that is monographic, and thus belongs to the Books format, even if it is a three-page pamphlet. The value “ib” represents a nonmusical sound recording that is a component part serial. Currently, “integrating resources” are being introduced into this milieu to cover “updating” loose-leaf services, websites, etc. New types of resources, e.g., virtual reality products using haptic feedback, and combinations of types will continue to emerge, further stressing this arrangement.

104


Encoding basic record types with two byte codes is the tip of the iceberg. In order to handle materials such as video serials or digital monographs, their “additional material characteristics” are encrypted in a control field (006). As new code values are defined, increasingly esoteric combinations of values are possible. Typically, years pass before bibliographic utilities and integrated library system vendors complete the introduction of such changes. Not only does MARC accommodate complexity, it invites it. The crux of the matter here is the repeatability of categories representing the form/genre of resources. Defining those categories should be the primary task. Clearly indicating that a video serial belongs to two broad categories should be fairly straightforward using repeatable variable-length values: Category (Primary):

Visual Material

Category (Secondary):

Serial

Graphic interfaces for cataloging translate MARC codes into eye-readable values for data entry, but online catalogs must retranslate the codes for display and the limiting of searches. Typically, the values are not directly searchable. Bibliographic Level and Type represent the top levels of polyhierarchies that should seamlessly connect to the existing hierarchies of specific form/genre terms covered elsewhere in MARC. The more salient question may regard the extent to which we can rely on associated hierarchies of authority records versus recording both general categories and specific terms in individual bibliographic records. Format integration, which astronomically inflated the possible codings for control fields, was only partially successful in eliminating existing inconsistencies in the definition of the same byte position in the overlaid formats. There are also instances of the same content coded differently by format. 008/34

Books

Biography (e.g., collective biography)

008/34

Serials

Entry convention (e.g., successive entry)

008/34

Visual

Technique (e.g., animation)

008/24-29

Music

Accompanying matter (e.g., biography)

Control fields predetermine that decisions must be made for each discrete byte sequence for a given “format.” For example, the record for each monograph states whether or not it is a Festschrift. Routinely having to code such facts gives disproportionate prominence to all “fixed-length data elements.” “Festschrift” is a form/genre term and would better be recorded only when applicable. Software could remind catalogers to consider such options, avoiding the phantom “Not a Festschrift” or worse “No attempt to code” values. Such historical precedent is simply unjustified and should not be perpetuated. 006/13

Books

Festschrift

008/30

Books

Festschrift


105

There are many other coded control fields, and the practice spills over into the variablelength fields (including their indicators) in several cases. A sampling of these follows: 008/15-17

Place of publication, etc. (also 044)

008/28

Government publication

008/24-27

Books

Nature of contents (up to 4 codes)

008/35-37

Language (also 041)

subfield 4

3-byte relator (relating a name to a work)

The values represented by the codes are not correlated with values recorded in “subject” fields (650, 651, 655), since many schemes may be used. For example, MARC (008/28) indicates that gazetteers should be coded as dictionaries, while LCSH (Library of Congress Subject Headings) treats them separately. Furthermore, the geographic area code (043) uses different codes from the place of publication code noted. Language codes, geographic codes, and relator codes (^4) are governed by lists of values rather than by authority records (LC Network 2002). This type of disjunction would have an adverse effect on retrieval, if values represented by the codes were searchable. For most purposes, the practice of using coded forms of information is no longer justified as a space-saving or efficient keying strategy. This false economy impedes searching and adds overhead to presentation. The codes are largely redundant with entries under authority control. Brief forms or abbreviations could be included as variants on authorities to improve both data entry and retrieval. XML schema implications: To be justified, data elements should be directly traceable to user retrieval. Considerable simplification of MARC without loss of content is possible by eliminating the redundancy between coding and explicit access. Recording the same information twice increases work and invites discrepancies. Cryptic coding in fixed and variable fields generally does not improve user access. The MODS schema recognizes this in its mapping of form/genre codes to terms (LC Network 2002c). Repeatability of terms eliminates the problem of limited code positions. Techniques to require a choice from among a set of required values are possible in an XML editor, perhaps as part of a generalized authority control mechanism. MARC’s economy of expression is counterproductive. XML has economy of structure, but is verbose in conveying content. This matches the needs of today’s web environment. Date Inconsistency

Dates represent an especially confusing area in MARC. There is insufficient space for a Y2K-compliant entry date (pattern yyyymmdd) in the 008 control field without redefining all subsequent bytes: 008:00-05 Date entered on file (pattern yymmdd) 008:06

Type of date/publication status

106


There are, however, multiple codings and formats for various kinds of dates. Such unnecessary variety challenges the processing of chronological information carried in MARC records. The following examples illustrate the range of formats and usages: 0. 005

20000102151047.0

yyyymmddhhmmss.f

1. 008:00-05

000102

yymmdd

2. 008:07-10

1999

yyyy

yyyy

e

(detailed date code)

008:07-10

1999

yyyy

008:11-14

1231

mmdd

^c Dec. 31, 1999.

(transcribed date)

^c [2002]

(supplied date)

^c c1956.

(copyright date)

5. 263

â 200301

yyyymm

6. x00

^d d. 1632.

(format varies)

008:11-14 3. 008:06

4. 260

^d 1948-2001. ^d 1240 or 41-ca. 1316. ^d 1942 Apr. 77. x30

â . . . (1933)

(unsubfielded parenthetic)

^d 1963 Sept. 16.

(date of treaty signing)

8. 245

^f 1903 Sept. 2-1907 Oct. 5.

(archival inclusive dates)

9. 246

^f Jan. 1970-Apr. 1974

(date/sequential designation)

Item 0 above represents the update date of a record and item 1 a create date; both are related to processing rather than the content described in the record. It is curious that the formats for these two closely related dates differ. Item 2 represents the normalized “yyyy” date based on the descriptive or transcribed date recorded in 260 ^c, with the provision for a second yyyy left blank when only one date is present. Depending on the predefined code given in 008:06, the two dates can represent various types of dates. This leads to complexity when the data definitions need augmentation or refinement. Item 3 illustrates one specialized kind of date, with “e” representing a detailed date where the second yyyy is redefined to mmdd in order to accommodate the normalized version of the first descriptive date given in Item 4. There are fourteen different “type of date/publication status” values, which modify the meaning of the dates in 008:07-14.


107

However, the projected date of publication occupies a separate field with prescribed formatting shown in Item 5. Item 6 illustrates a few dates occurring in personal name headings. These are governed by cataloging convention and vary widely in format. Uncoded dates may also appear parenthetically to corporate names. Items 7–9 reflect dates occurring in titles, also prescribed and formatted according to cataloging rules. These cases are not exhaustive, e.g., field 518 accommodates Date/Time and Place of an Event. Are so many codings and formats necessary to accommodate the actual variety of chronological data? Curiously, in documentation introducing a new MARC tag, the Non-MARC Information Field (887) illustrates a Y2K-compliant date with XML markup! Incidentally, the embedded XML is illegal (start/stop tags do not match). 887 __ â20000617 XML schema implications: XML’s emphasis on the reusability of information supports an approach to cleaning up MARC’s messy temporal data. Define a chronological element once and use the same format wherever dates are needed. Follow the ISO standard for the actual date values (Kuhn 2001). Some extensions may be necessary, e.g., calendar identification. Reflect date ranges as pairs, or potentially quartets, of chronological elements. Rely on stylesheets for achieving the standard’s display formatting, e.g., 2002-01-02. Use attributes to accommodate coded or descriptive metadata relating to an actual date, such as prefixes (e.g., c =copyright), punctuation (e.g., brackets =supplied), field tags (263 =projected publication), and subfields (245 ^f =archival inclusive dates). Alternately, define an optional descriptive version to accompany the formatted version of a date. Other than in description, do not segregate some dates by omitting coding (e.g., as a parenthetic qualifier to a name or title). Intra- and Inter-Format Redundancy

A familiar feature of MARC is the provision of multiple, dispersed areas to record the same, or same type of, data. Routinely, data is transcribed as well as encoded. Although a single language code often suffices for a work, there are three places in addition to a subfield in uniform titles (^l) to record this information as conditions merit. 008:35-37 Language code 041

Language codes (field and subfields repeatable)

546 â

Language note (field repeatable)

x30, etc. ^l Language of a work (pairs may occur, e.g., Greek & Latin) There are seven separate, repeatable subfields to specify parts of a work referenced, e.g., the language of a table of contents, or to indicate the language(s) of a work used as the basis of a translation. While languages are spelled out in uniform titles (x30, 240, 243), a code is used to indicate the language of a translated title (242 ^y). The variations reflect description, controlled headings, and encodings.

108


Encoding form is extreme with seven possibilities: 006

codes (various byte positions)

007


008


245 ^h

General Material Designator

300 â

Specific Material Designator

650 etc. ^v

Form subdivision (repeatable)

655

Index term—genre/form; various subfields

In addition to intra-format redundancy, there are also cases of redundancy across the five MARC formats for bibliographic, holdings, authority, classification, and community information. These formats are coordinated to limit confusion. However, only convention and the limited use of some fields prevent collisions. Tag

Format

Name

678 545 545

Authorities Bibliographic Community

Biographical or Historical Data Biographical or Historical Data Biographical or Historical Note

XML schema implications: Simplicity is elegance. It is difficult to coordinate so many different approaches to recording the same information. Merging the values represented by codes into a scheme of form/genre terms would make them more directly accessible. Ideally, a more coordinated approach to constructing such a scheme would improve consistency and thus retrieval.

Mingling Data Elements and Their Attributes

Most bibliographic data consists of two parts: the data content itself and information about the data. Clearly delineating content-bearing data elements from data regarding their properties or attributes simplifies information management. MARC blurs this fundamental distinction. In particular, two groups of subfields often comprise a single field. One group comprises the data content that the field ostensibly represents, while the other carries information about the content of the subfields in the first group. Similar to the second set of subfields, two indicator values may alter the meaning of the data in the first set of subfields. This admixture of elements and their attributes underlies much of MARC’s complexity. Adding to the confusion, two different data elements are often included in the same field to accommodate pre-coordination, especially for name-title entries. Having to provide encoding for a title in a field designated for names throws a wrench into the basic arrangement: one set of subfields and one set of indicators must attempt to do double duty. In the case of field 700 (Added Entry—Personal Name), the subfields fall


109

into five categories shown below. MARC documentation merely lists these in alphabetical order and indicates repeatability, a further complication in this jumble. Subfields (Count)

Group

abcdjq (6) eu4 (3) fghlmnoprst (11) x (1) 3568 (4)

Name About the name Title About the title About the combination

Dissecting a “simple” catalog entry will help illustrate how a field with three subfields involves data intermingling. This display of a typical MARC field is followed by a breakdown of its components: 700 1_âBillings, John Shaw,^d1813-1938,êcollector. Heading:

Billings, John Shaw [surname, forenames] 1813-1938,

[dates]

Relationship:

collector.

[relator term]

Coding:

700

[field: added entry—personal name]

1

[indicator: surname (occurs first)]

_

[indicator: blank =no information as to analytical entry]

â

[subfield: surname, forenames]

^d

[subfield: dates]

ê

[subfield: relator term]

In this example, the field tag (700) actually resides in a directory at the beginning of the record with a numeric offset pointing to the location of the related data string. The tag itself carries two attributes of the value—recording that the value is a personal name and that it is an added entry. The data string itself conventionally begins with two bytes called “indicators.” (However, the record leader [000/10] defines the indicator count, which technically may vary from 0 to 9; RLIN has used a third indicator to indicate the authority control status of a heading.) The activation of additional indicators is unlikely. Next, â and ^d record the heading or entry, with â containing the inverted name with a comma separating the surname from forenames. UKMARC has separate subfields for surname and forename; the fate of this valuable coding in their conversion to MARC21

110


is uncertain. The dates in ^d are recorded when available when the heading is established, but are generally not altered unless a conflict with another name occurs later. Lastly, the defamed ê may indicate the relationship of a name to the work. This probably correlates with cataloging’s ambivalence toward recording relations for names. “Jt. Author,” etc., were removed from ê long ago, and more recently “ed.” was banished, perhaps due to complicating authority control of headings. In Finland, the relator term is considered part of the heading for indexing purposes, e.g., to distinguish in which cases a particular person is a composer versus a performer. Although the use of ê has eroded, repeatable three-digit relator codes (^4) have somewhat offset this by defining a wide variety of relations, including a value “edt” for editor and, interestingly, “aut” for author. There seems to be a mixed message here. Incidentally, the comma before ê illustrates conditional punctuation, problematic of ISBD (International Standard Bibliographic Description) in general. Were ê absent, the comma would need to be replaced by a period (full stop). The comma following the surname can be used in conjunction with the first indicator value “1” as the basis for the automated separation of forename from surname with a degree of accuracy. MARC coding adds more complexity than necessary to record the required data elements and their attributes. MARC tags, indicators, and subfield delimiters may appear to parallel XML’s markup of data values, but they do not reflect the basic distinctions between XML’s concepts of elements and attributes. Instead, the interpretation of the MARC tagging relies on definitions recorded externally in the format’s cumbersome documentation. Labeled displays in online catalogs reconstruct these meanings to an extent, but often do not render them for users at all. Disuse and sporadic application of many of the complexities in MARC coding is indicative of their not being easily defined. Though not evident in our example, there exists the further complexity of defining the same data in several places, e.g., personal names in 100, 600, 700, 800—with slight variations, e.g., the key for music (^r) is available for all except for 100. There are also well-known variations in handling the same data, e.g., the nonfiling indicator count varies between the first and second indicators depending on indicator availability. XML schema implications: MARC uses too many varying ways to distinguish metadata from data content. Separate elements and attributes are an intrinsic part of XML. Each individual element can have its own attributes, as well as those applying at the container element level. In some circumstances it may be advisable to treat attributes themselves as separate elements; container elements can discretely organize a data element and its associated metadata within a container element. The fundamental separation of elements and attributes should be maintained regardless of which method is chosen. Relationship Dispersion and Irregularity

At first glance, the MARC bibliographic format’s linking entry fields (76x-78x) would appear to represent relationships adequately. Although these fields cover many bibliographic relationships, the format is actually riddled with other relationships, some explicit and others implicit. A sampling of fields that often include bibliographic


111

relationships is included in the following list. They represent the wide range of variation in MARC’s treatment of relationships. (Further aspects of relationships are treated in the sections on AACR and XOBIS later in this chapter.) 4xx 500 505 510 530 533 534 544 545 555 556 581 7xx 830 856

*Series *General Note ***Contents Note *Citation/References Note ***Additional Physical Form Available Note Reproduction Note */**Original Version Note Location of other Archival Materials Note ***Biographical or Historical Note ***Cumulative Index/Finding Aids Note **Information about Documentation Note **Publications about Described Materials *Added Entry (author/title, title, or uniform title) Series ***Electronic Location and Access * ** ***

Includes subfield for ISSN Includes subfield for ISBN Includes subfield for URI

Most of these fields tacitly acknowledge the presence of a relationship by providing for a standard number or link to the related work. They appear to represent three approaches to recording relationships: (1) a uniform title or series added entry with optional ISSN (International Standard Serial Number); (2) an author-title added entry with optional ISSN; and (3) a note with optional ISBN, ISSN, or URI. The implications of these practices merit further emphasis, as provided below. 1. Not treating series as relationships highlights the difficulties resulting from reliance on authority control. Many series are cataloged as serials as well, essentially allowing two hosts for the same title: a serial and an authority. This built-in conflict extends to analytic records that mostly rely on authority control of a heading in field 830. As noted in the list above, field 830 lacks a numeric link, deferring to the ISSN of a parallel 4xx field to accomplish this. The repeatability of these fields further distances the numeric link from the uniform entry. Serial-analytic (parent-child) relationships need not be so complicated. Linking entries (773) handle similar inter-record relationships for component parts (000/07 =a or b) and subunits of collections (000/07 =d). A code to indicate an analytic of a series (000/07 =p) was discontinued long ago. Other codes for Bibliographic Level (000/07) represent the type of material rather than relationships, blurring the purpose of this pivotal field. 2. Relying on headings to collocate entries in an index in lieu of explicit relationships between records is at the root of the problem, both for series/uniform titles and

112


for author-title added entries. Although eye-readable, the relationships cannot be utilized readily by computer programs to support linkage between records. Author-title entries introduce the further complication of pre-coordination. Generally, numeric links to the related bibliographic records go unrecorded. Alternately, introducing a third record, an author-title authority, adds redundancy and further complication. 3. The encoding excess discussed earlier applies to many “notes” shown in the list above, some of which provide extensive subfielding to reproduce a coded description of the related work, especially in fields 533 and 534. Functionally related granularity is one thing, but is it really necessary for a note to have fifteen subfields? Two discrete records with simple reciprocal links would more than suffice. In addition to series/uniform titles referring to related works, other access points comprise a major category of relationships in MARC. These are mostly implicit in linking a name, topic, etc., to an authority record. Actual numeric links, rarely seen outside of some library systems’ authority modules, are not defined in MARC. Authority records also contain relationships, primarily the “See Also from Tracing Fields” (5xx). The use of ^w to specify the kind of relationship is discussed in the next subsection. Authorities may contain other valuable information, particularly definitions and biographical or historical data, but their role in controlling a single form of heading is too often the only one in evidence. Creating an “authority” often consists of copying a name heading and identifying the bibliographic record where it originated in field 670. This practice is curious when so many names occur only once. A work may also contain explicit relationships to a name in MARC, although authorship is often implied. Personal and corporate names in a bibliographic record may have a relator term (ê) or repeatable relator codes (^4), while only the code is defined for meeting names. The recent effort to define and link relators more systematically is a key development, recognizing the importance of relationships beyond those between works. Unfortunately, the emphasis focuses on codes rather than explicit, accessible relationships. The intended role for relator codes is unclear. Relator codes (^4) are not defined for authorities. Similar to relationships in notes, the publisher relationship usually lacks an entry. Electronic Location and Access (856 and û for URI in various fields) represents an explicit relationship to the physical location of a digital resource or a related one, much as a call number represents a relationship to a physical location. While field 856 contains a multitude of subfields, only û was included when sprinkling URIs in other MARC fields. The û also occurs in Authorities, e.g., field 678. Hyperlinking has introduced further variations in how relationships are handled in MARC. Cataloging conventions and MARC’s inconsistent treatment of inter-record relationships thwart the development of more rigorous, and hence more useful, bibliographic systems. Energy spent on much of MARC’s esoterica could be redirected toward content enrichment of authorities and bibliographic records, and provision of a uniform, comprehensive method of recording bibliographic and authority relationships. XML schema implications: Relationships are of such fundamental significance that their treatment should be considered separately. MARC handles relationships in too many different ways. Developing a new schema offers the opportunity to reconcile


113

the many variant methods. To build a more coherent structure, bibliographic entities need to be recorded in a more discrete/atomic manner, one identity per record. This would make it easier to define a single, crisp linking mechanism for all types of relationships. This could take advantage of XML’s linking technologies or merit a special structure to allow an element to contain more complex information about the name, type, quality, duration, and descriptive details of a relationship that do not properly belong to either of a pair of related records. Extreme Coding Complexity

MARC reaches its zenith, or nadir, depending on how you look at it, when the available values for fields and subfields are exhausted. Perhaps the most exotic solution occurs when fixed-length codes are embedded in subfields of variable-length fields. Unfortunately, this occurs in the heavily used reference fields of the authorities format, obscuring the kind of relationship—a key piece of data: 4xx, 5xx ^w/1 ^w/2 ^w/3 ^w/4

See and See Also Tracing Fields Special relationship (e.g., broader, narrower, earlier, later) Tracing use restriction Earlier form of heading Reference display

7xx ^w/1

Heading Linking Entry Fields Link display

A variation is found in field 880 (Alternate Graphic Representation), where a formatted control subfield (^6) introduces a different script: ^6 - / / . Note in the following example of two reciprocal headings that the “(N” indicates “Cyrillic” script: 100 1_ ^6880-01â[Heading in Latin script] 880 1_ ^6100-01/(Nâ[Heading in Cyrillic script] Defining byte positions in subfields extends to the Bibliographic format: 76x-78x ^7/0 ^7/1 ^7/2 ^7/3

Linking Entry Fields Type of main entry heading (e.g., personal, uniform title) Form of name (based on byte 0; e.g., family name, jurisdiction) Type of record (same as 000/6 values) Bibliographic level (same as 000/7 values)

In addition to the complexity of the value of byte 1 being conditional, based on the value of byte 0, this coding duplicates information from the related record. The four codes in ^7 (c1as) in the example below recapitulate that the entry is a corporate name (c) entered under jurisdiction (1) for the related language material (a) serial (s):

114

In the Scheme of Things Author/Title for a Serial Record:

110 1_ âUnited States.^bGeological Survey. 245 10 âWater supply papers Related Record’s Main Series Linking Entry to Above:

760 0_ ^7c1asâUnited States. Geological Survey.^tWater supply papers Despite the ingenuity of this technique to re-create coding of the related work’s entry, the subfielding remains inconsistent: the corporate entry differs from that of the same corporate entry in the related record’s link (^b appears in the 110, but there is no equivalent coding for ^b in the 760). In field 760, ^b is reserved for a related record’s edition, just one of many potential conflicts that led to this mess. Often such conflicts make it impossible to code entries and their counterpart linking entries identically in MARC. What purpose does this convolution serve when a control number subfield is available for unambiguous linking? Also striking in the foregoing example is the pre-coordinated “entry” (110/245) for the work. This field combination is supposed to be equivalent to the linking entry: 110 1_â^b + 245 10 â

=

760 0_^7c1asâ^t

It is a virtual entry that must be constructed for linking (in this case), indexing, display, or other processing. Most surprising of all, the entry representing a given work is not explicitly stated in its own bibliographic record. The combination of fields and subfields that comprise it must be ascertained from documentation and/or cataloging rules. The possibilities are astonishing: (100 or 110 or 111 or 130) and/or (240 or 243 or 245), with some combinations invalid. Beginning with these possible field combinations, the corporate authortitle alone (110/245) presents 16 possible indicator values and 28 subfield possibilities from which to extract the entry. (Teasing apart MARC’s admixture of data and metadata is discussed further in the AACR subsection on “Pre-Coordination” later in this chapter.) The practice of embedding coding in subfields is widespread in MARC. The following example illustrates the repeatability of a coded subfield. The coded value “2\c” in ^8 indicates that the LC subject heading applies to the second (\2) “constituent item” (c) listed sequentially in an accompanying contents note (505). Repeated subfields indicate that this is also true for the third and fourth constituent items, each of which is separated by conventional punctuation ( -- ) in field 505. 650 _0 ^82\c^83\c^84\câOperas^xExcerpts. The next example is from the Foreign MARC Information Field (886). It illustrates two additional quirks. The first â below is defined as not repeatable since it represents the foreign tag (019), while the second â is repeatable—for each â value occurring in the original data. Additionally, ^b consists solely of two blanks; elsewhere blank subfields would be omitted. (The ^2 identifies this as Iberian MARC.)


115

886 2_ ^2ibermarcâ019^b__âVG 586-1992 Until recently, field 041 contained up to six three-digit language codes lumped in a single subfield. This embedding was rectified by making field 041 repeatable and by making six additional subfields repeatable. Although an improvement, it still reflects fixed-length codes in otherwise variable-length subfields. How does most of this exotic encoding benefit programmers, catalogers, or users? The extreme complexity invites errors in data entry and uneven application or omission entirely. Despite all the effort to define and encode this data with such a high degree of granularity, in most cases one thing is clear: the resulting dearth of clarity. XML schema implications: MARC’s complexity may stem from trying to define too many different things within a single scheme. Only a relatively small percentage of tags are used frequently. Many additions have resulted from requests by groups addressing special needs or kinds of resources. Using a divide-and-conquer approach by focusing a core schema on bibliographic and authority mechanisms would narrow the problem. XML namespaces could be used to incorporate specialized information to provide in-depth coverage in areas such as rare books, music, maps, etc. The current complexity makes it difficult to manipulate and display data and hinders sharing it readily. To be effective, a “communications format” needs to be in tune with the Web’s instant communication. Transitional Character Set

Character encoding is a complex issue. Currently, most MARC data is encoded in the “MARC-8 environment,” an amalgamation of character sets: ASCII, ANSEL extensions for selected diacritics and special characters, and various non-Roman sets. A complex mechanism based on field 066 (Character Sets Present) and various conventions permit the invocation of escape sequences to switch among these sets to key over 15,000 characters. MARC-8 works in controlled system environments. However, presenting MARC data on the Web has been problematic. Diacritics and special characters continue to cause display problems. Adding new characters has been glacial, considering the length of time it required for the spacing underscore and spacing tilde, which occur in URLs, to be incorporated in MARC. With the advent of the Universal Character Set/Unicode (discussed in chapter 1), the MARC-8 group of characters was designated the “MARC 21 repertoire.” The value “a” was defined (000/09) and field 066 was excluded to allow the use of UCS/Unicode in MARC records. However, all other valid UCS/Unicode characters are not permitted in MARC for some indeterminate transition period. XML schema implications: Unicode, XML’s only character set, is an international standard specifically designed to address the long-standing problem of character set incompatibilities. The restrictive MARC 21 repertoire is a tepid response to this significant development. Libraries should move quickly to adopt Unicode more fully, since the common availability of additional characters supported by operating systems, word processors, etc., is bound to lead to problems. Version 1.1 of XML allows all characters

116


not specifically restricted for cause, as new characters will continue to be defined. More information on Unicode is available in chapter 1 and in Lam’s description of handling multi-script metadata with XML (Lam 2001). Other Limitations

The problems with MARC that have been identified here by no means represent all of them. Where hierarchical information is involved, MARC’s relatively flat structure makes this difficult to reflect. Field length limitations vary from system to system. Unjustified granularity exists, while functionally indicated granularity is lacking. Several major renovations of the aging and overburdened MARC format have been tumultuous, time-consuming, and expensive to implement for all concerned. In aggregate, the problems are such that MARC should be thoroughly reassessed and rebuilt from the ground up. It effectively prevents libraries from taking full advantage of XML and related technologies, and puts librarians at a disadvantage in the competitive arena of information management. Despite its long life and useful contributions, MARC now represents more of a handicap than an advantage.

AACR: Fixity versus Fluidity? The Anglo-American Cataloguing Rules (AACR) have been undergoing continuous revision since the publication of separate British and North American texts in 1967. The second edition (1978), called AACR2, overcame many transatlantic differences. The 1988 and 2002 revisions of the second edition reflect continuing cooperation in the English-speaking world and beyond. Extensive, influential Library of Congress Rule Interpretations correlate with the rules. Since the mid-1990s, however, the exponential growth of the World Wide Web has changed the bibliographic landscape dramatically. Unprecedented technological changes have created degrees of document fluidity unimagined during AACR’s formative years; an integrated international environment; and tantalizing new possibilities for more sophisticated information management. Not shy in tackling problems, the bibliographic community has responded with reassessment, introspection, and strategic planning. In 1998, the International Federation of Library Associations and Institutions issued the notable Functional Requirements for Bibliographic Records (IFLA 1998). In 1998–99 the Joint Steering Committee for the Revision of Anglo-American Cataloguing Rules issued “The Logical Structure of the Anglo-American Cataloguing Rules” (JSC 1998–99). This document is revealing in that it requires 806 pages to explain the 677page AACR2. These and other efforts should be precursors to a new international weboriented cataloging code, an appropriate endeavor at the symbolic beginning of a new century. It is not surprising that rules developed for card catalogs are encountering serious difficulties in a rapidly changing digital environment. Incremental changes to AACR are unlikely to prevail as information management becomes increasingly web-oriented. Even within libraries, there is bibliographic apartheid. The emergence of metadata schemes with different rules for handling digital materials compounds the existing


117

separation of cataloging and indexing. Cataloging’s slice of the bibliographic pie will continue to diminish if we do not question fundamental precepts and devise new rules appropriate to the new environment in which we must operate. The current AACR rules overemphasize description and underemphasize relationships. The questioning of rules, conventions, and practices that we will undertake here continues our exploration of the problems and prospects of cataloging in this emergent environment. The consequence of not embracing this singular opportunity for change will likely be a period of increasing confusion, stagnation, and erosion of the cataloger’s professional position. Libraries’ commercial competitors, who already have a head start, are reverse-engineering cataloging. With our pioneering history in distributed systems, it would be uncharacteristic of us to not embark on such a critically needed journey. Although the distinction between AACR and MARC is stated pointedly by the powers that be, in practice it is nearly impossible to discuss one clearly without the other. Despite protestations to the contrary, LC practice is inextricable as well. In 2002 a detailed analysis was released to clarify the relationships between data structures embodied in the MARC formats and the FRBR and AACR models (Delsey 2002). By contrast, the treatment we undertake here identifies broad thematic problems. The term “work” will be used here generically, reflecting all manner of bibliographic resources. Examples have been fabricated in some cases, and where necessary for clarity, some MARC tagging is included in examples. Identity Crisis: Identifying and Delineating Bibliographic Resources

Cataloging fails to provide a clear, unambiguous identity for most works. A conditional and varying subset of elements must be extracted to arrive at a surrogate entry, which still may be deficient to cite a work appropriately. Symptoms of the problems are often visible when scanning title indexes in online catalogs. The debate on whether to include or exclude subtitles in title indexes also suggests that the title proper is inadequate even as a title entry. Some subtitles add clarity to title indexes; others clutter them with extended textual passages. Catalogs do not deliver anything near a “standard” citation for a work. Despite all the formality and precision of cataloging rules, style manuals and bibliographic citation software reign whenever works are referenced outside of the catalog. This would seem to be the province of library science. Many problems derive from the lack of a straightforward identity for works. AACR specifies the description of bibliographic resources (expressions or manifestations of a work or an item) and the creation of headings (or access points) to serve as catalog entries. There are elaborate rules regarding the uniqueness of headings. However, only in limited situations does the uniform title main entry provide a single, unique identity for a work. Most works must rely on the coordination of name-title for identity. Since edition, date, and other distinguishing details are excluded from that combination, name-uniform title headings must be created to refer specifically to works that are not unique as to name-title. Such identities are not indicated in bibliographic records and are only stipulated in separate authority records under certain conditions. These complex

118


techniques, which are subject to uneven application and variation by catalog, thwart the development of more sophisticated bibliographic access and presentation. Cataloging relies on headings to represent works in alphabetical lists. This list orientation compounds the problem of identifying specific works, which is often complex. When related headings co-file in such lists, deliberately so in the case of uniform titles, often there is no formal relationship between the works represented by these filing siblings. Authority records also rely on co-filing rather than relationships. The list provides an illusory order, but in most cases fails to reflect dependencies of the implicit hierarchies. Without special programming, computers have no inkling of whether the next entry in a list is related to the previous one or not. There is a tension in cataloging between entries, just discussed, and linking entries. In AACR, the note area describes relationships between works. (The common MARC term “linking entry” is not used in the index or glossary to AACR.) Such notes were pressed into service as “entries” to reflect the importance of serial relationships. However, they are not true headings, subject to authority control, and their quasi-use as such has set up a fundamental conflict between authorities and linking entries as the means of recording bibliographic relationships. (The “Building Better Relationships” and “Pre-Coordination” subsections below deal with this systemic problem in more detail.) The title serves as the chief name or identifier for a work and serves as a pivot point in cataloging. Unfortunately, cataloging blurs the distinction between description and access discretely by compacting both the title and statement of responsibility into one “area.” The transcribed “title proper” subset from this admixture is expected to serve as an access point when possible. Certain title information constitutes this title proper, while the remainder is descriptive only, as is the statement of responsibility. Thus, there are actually three different kinds of information present in one area: Subfields of Title Statement

Title Proper Title Remainder Responsibility

245 â^n^p (definitely); ^f^g^h^k^s (uncertain) 245 ^b 245 ^c

The lack of clarity in specifying precisely what an entry for a work should be further complicates matters. In AACR, “entry” is an ambiguous concept; it is defined as the catalog record, but commonly refers to the heading under which a work is entered in a catalog. Do “main entry” and the “title proper” together comprise the actual entry for a work? The provision for “name-title added entries” indicates that other works be referenced by such an entry. The dependency of generic titles, e.g., “Annual report,” on “main entry” further illustrates this interdependency or implied coordination. “Uniform title,” either alone or in conjunction with a main entry, adds further complexity. While the name portion of these couplets can stand alone, the title often cannot—creating problems in title indexes: 100 Minton, Henry L. 245 Departing from deviance :


119

110 American Law Institute. 245 Proceedings / 100 Freud, Anna, 1895-1982. 240 Works. English. 1967. The unambiguous entry of works seems limited curiously to serials, where uniform titles often serve in this role as title main entries. Oddly, this type of heading (a single entry instead of a couplet) often fuses a title-name without the benefit of MARC subfielding used in most other cases. Presumably, neither the descriptive title (245) nor the added entry for the society (710) would be considered part of this work’s entry: 130 Annual report (Gambia Ornithological Society) Interestingly, archival cataloging resolved a similar problem in favor of explicit titles, which, although constructed by catalogers, are not uniform titles: 100 Murray, John M. (John Milne), 1896-1982 245 Papers of John M. Murray, 1915-1982 (bulk 1933-1976) Joint name-title headings are mostly found as added entries. Where multiple editions, translations, etc., are involved, added entries become more complex due to their uneasy relationship with name-title authority headings representing the same work or an umbrella for a group of works. Often the practice of creating such authorities is skipped altogether. The technique may well introduce more problems than it solves. The edition, which is often a prominent distinction between two identical titles or identical name-titles, is a separate AACR area. In the majority of cases, editions are numeric; other numeric designations, e.g., “Part 4,” are included as part of the title proper when a work issued in sections is cataloged separately. Thus, one kind of numbering closely related to the title is recorded in the title area and another in the edition area. The edition area also has its own statement of responsibility, following the pattern of the title area in mixing data elements. The series forms another separate area. A series actually represents the relationship between the title of a work and the title of its parent series, another “work.” Series also have statements of responsibility, which would seem to belong more properly on their own serial record. Not treating series as relationships is highly problematic. Ignoring the entry itself, contrast how the enumeration for series (830 ^v) and component-part linking entry (773 ^g) is typically entered: Series v. 17 Component Vol. 17, no. 98 (Feb. 1948), p. 78-159 Even when the entry is the same, they do not interfile appropriately. Trying to collocate or link a serial title, its analytics, and its components is more difficult due in part to the varying treatments resulting from defining series separately from other relationships. Cataloging rules have been developed to solve particular problems. These solutions fail

120


in cases where data, formerly separate, is thrust together in digital environments. Considering the larger picture reveals the stress fractures in current practice. Combining information from different sources introduces more difficulties. An indexing agency might record equivalent information, typically with differing syntax, in another format that could be converted into a component part record: Indexing

1948 Feb; 17(98):78-159

Thus, there are three common formats that differ for the same data, and the dates are intermingled with enumeration in one case (773). The same dependency issues discussed for titles apply to series titles. Almost instinctively, rules have moved toward title entry to avoid the conundrum of joint headings for name-series title (MARC’s 800, 810, 811). The adage “Do not strain for consistency” is well advised in many situations, but not at a fundamental structural level. Clear bibliographic identities are needed upon which to base clear bibliographic relationships. Building Better Relationships

Only recently have relationships been accorded the degree of prominence in cataloging that they deserve. Despite efforts to handle relationships as a fundamental aspect of cataloging, only a subset of them is handled systematically. Others remain dispersed about the cataloging record, causing unnecessary variation. This subsection identifies some of the places in the catalog record not covered by linking entries—the formal mechanism of recording relationships described in notes. An exploration of these inconsistencies reveals the structural origin of problems resulting from embedded relationships and the inherent conflict between uniform title authorities and linking entries. Uniform title headings were developed because the identity of works can be problematic in cases of multiple manifestations. This technique serves to produce an orderly list of index entries. However, the actual hierarchical relationships present are only implied. The headings of the top levels of many hierarchies serve primarily as umbrella authority records, while a heading at the bottom level often applies to a single bibliographic record. Rarely are online catalogs able to capitalize on such implied hierarchical structures as are found in indexes: ▼

▼Bible. Bible. N. T. ▼ Bible. O. T. Bible. O. T. German. Reuss. Selections. 1923.

The practice of creating uniform titles has been extended to cover disambiguation in cases of title conflict, particularly for serials. Unqualified titles in such cases are not the parent for separate titles, although they are sometimes related via an intervening title. Often authority records for such uniform titles are not created, as these would be little more than shadow bibliographic records. Journal of electricity (San Francisco, Calif. : 1895) Journal of electricity (San Francisco, Calif. : 1917)


121

Series are treated as uniform titles. For numbered series, it is difficult to know whether a uniform title series authority record, a serial bibliographic record, or both are needed. Advances in cardiology Advances in cardiology ; v. 10. Advances in cardiology ; v. 32.

[130 Series Authority] [830 Analytic] [830 Analytic]

Hierarchical relationships are actually involved. They imply the relationship between the series itself and the title of the individual volumes of the series. There are sequential relationships among the sibling titles, all of which have a subordinate relationship to a shared parent title. ▼Advances in cardiology v. 10: Body surface mapping of cardiac fields v. 32: Assessment of ventricular function

[245 Serial] [245 Analytic] [245 Analytic]

The overlap between series (an authorized heading) and serial (a bibliographic record) is indicative of a structural conflict that makes access to and display of series’ relationships to their parent title suboptimal. Both treatments represent legitimate cataloging, but they are not readily reconcilable. Comparing the wildly differing coding available for the identical data elements when the parent title above is interpreted as an authority versus as a bibliographic record dramatizes this conflict: Data Element

Authority Coding

Bibliographic Coding

Title

130

245

Variant title

430

246

Earlier title

530 ^waâ

780 ^t

Place/publisher/date

643 â^b^d

260 â^b^c

Enumeration/chronology

640

362

Numbering peculiarities

641

515

German cataloging rules avoid this problem by acknowledging that the parent title (the collective title for a multivolume work) and each individual volume title all need bibliographic records. It may be ill advised to advocate international acceptance of flawed American practices when solutions developed elsewhere are meritorious. Achieving international consensus might be easier in the context of new rules, where the best of many practices could be reconciled more easily than by assuming AngloAmerican precedence. The common hierarchical patterns of inter-record relationships shown in the example below help illustrate the built-in conflict between series and serials. Numbered monographic series enjoy the distinction of being both series (under the purview of authorities) and serials (treated as bibliographic records). The titles of analytics “link” to their host monograph (1) or serial (2) using uniform titles, whereas their component

122


parts use linking entries. The separation of indexing (component parts) from cataloging (analytics) served to avoid this problem, which now impedes fully integrating digital libraries. Collections (3) use linking entries exclusively, whereas other relationships rely solely on the deliberate co-occurrence of uniform titles (4) in separate authority and bibliographic files. 1. ▼Monograph (245) ▼Analytic (830)

▼Monograph (245) Component (773)

Component (773)

2. ▼Serial (245) ▼Analytic (830) Component (773)

▼Serial (245) Component (773)

3. ▼Archival Collection (245) ▼Subunit (773) Subunit (773) 4. Uniform Title Authority (130) Uniform Title Bibliographic Entry (x30) Instead of trying to predefine individual bibliographic formats and specific relationships as part of the bibliographic structure, authority records for each sanctioned format and each defined relationship could allow these to evolve and change without disrupting the structure itself. Previously, format integration and, now, realignment to accommodate “integrating resources” are unnecessarily disruptive to bibliographic systems—especially to relationships. It is inevitable that “formats” will continue to be reinterpreted. Users care about types of resources, e.g., a database, website, or periodical; how librarians choose to organize what are really form/genre designations (655) should not affect the bibliographic structure. A primary form/genre term could represent format. Flux in what constitutes bibliographic “formats” makes defining a stable system of bibliographic relationships unnecessarily difficult. Bibliographic records have been overloaded with all sorts of “enrichment.” Such information is valuable and is better tacked onto a related record than omitted. However, in many cases it blurs identities between works and tends to make records presumably describing one work border on being free text. While free-text searching and database records are both valuable, there are different retrieval issues for each. Linking separate records representing works more atomically could reduce problems for users, catalogers, and programmers. Notes disguise many relationships. With upwards of ten subfields, many of these “notes” represent embedded records for works related to the work represented by a record. One example is the Original Version Note (534), which is referred to as a “citation” in MARC documentation. Note how the mostly descriptive subfields parallel fields:

In the Scheme of Things Subfields of 534

â Main entry of original ^b Edition statement of original ^c Publication, distribution, etc. of original ê Physical description, etc. of original ^f Series statement of original (repeatable) ^l Location of original ^n Note about the original (repeatable) ^t Title statement of original

123

Parallel Field

1xx 250 260 300 4xx 852 500 245

Such notes may accommodate materials not owned, but the technique adds unnecessary complexity to a single record. They also fail to maintain subfielding from the associated record. The Contents Note (505) often reflects multiple subordinately related works. Its format is determined by the relevant cataloging rules. The field has grown in complexity from basic to “enhanced.” The new subfields may be useful for display, but are inadequate for improved indexing. Coding titles (^t) without accounting for initial articles and using statements of responsibility (^r) instead of inverted, uncontrolled names (720) are of limited value: 505 ... ^tQuark models /^rJ. Rosner -- ... ^tJet phenomena /^rM. Jacob -- … 505 ... ^gvol. 1.^tThe history of Anne Arundel County … Trying to make description serve as access is fraught with problems. Some vendors offer alternative local fields when providing contents to circumvent the inadequacy. The display of separate linked records could be based on one “contents” relationship in a parent record, which could retrieve the linked component records on demand, or automatically. The display might look identical, but a gain in flexibility would result. Such component part records could in turn link to full text. Lumping contents on parent records degrades search precision. A search for “steel properties” would include this nondescript hit due to its contents note: Chalmers anniversary volume / [main entry under title] Fragment of Contents Note from This Record

… -- Glass formation, structure, and kinetics / D.R. Uhlmann -- The mechanical properties of steel at high temperatures / F. Weinberg. instead of one like this reflecting the relevant content: Weinberg, F. The mechanical properties of steel at high temperatures In: Chalmers anniversary volume

[author] [title] [link]

Such lumping increases the frequency of “false drops” due to the intermingling of characteristics of separate works on a single record. Some libraries add subjects based on

124


specific contents, further blurring the identities of component works. (Compare the MARC section earlier in this chapter for associating subjects with contents notes.) Sometimes data representing two or more works described in one record may support valid unique retrievals. The value of such coincidences is not lost due to creating separate records for each work. For example, the Research Libraries Information Network’s (RLIN’s) clustering technique shows that the co-indexing of related records makes retrieval more effective. Notably, the VTLS system’s demonstration of support for the FRBR work/expression/manifestation relationships bodes well. System limitations and the lack of crisp inter-record relationships have limited wider reliance on these techniques. Most encouraging, XML’s linking techniques are ideal in their support of treating records discretely, or as a group, when needed. Assembling separate records is relatively easy, but lumped records cannot be split on the fly. When contents notes actually signify component parts, they should be treated as such; creating component part records need not be so difficult. Divide and conquer, a time-honored precept, should serve to delineate different works crisply, but not to divorce cataloging from indexing—different methods for providing access to the same type of content. Less complicated cataloging could serve to unite indexing and cataloging—seamlessly. Cataloging practice indicates other notes to justify uniform titles, although they often describe ordinary inter-record relationships. This construction is common for translations: 245 10âBehold man : 240 10âSe människan.^lEnglish 500 __âTranslation of Se människan. There is a linking field for Original Language Entry (765) with a wide assortment of subfields, but not one for language. Is this apparent disjunction between AACR and MARC symptomatic? MARC documentation suggests Translation of: for displaying: 765 1_ … ^tSe människan Instead, following current practice, an AACR name-title entry would require a third record—an authority record forming a roundabout linking entry that could cover all English translations. 100 1_ … ^tSe människan.^lEnglish 400 1_ … ^tBehold man This is only one example of “empty” or “redundant” authority records, ones that often repeat information from a single instance of a work, where a link is actually needed. In some cases, uniform entries serve their original purpose as an umbrella for many works; these are not very different from collective bibliographic records. Creating umbrella bibliographic records designated to control uniform entries that only occur on the linked subordinate bibliographic records would be more economical and more coherent. Generally, entries are not made from an original to its translations. In the example below, a cataloger apparently could not resist including an internal note about an available translation of this title, although the translation has a uniform title authority.


125

A generic link as suggested above for contents notes would permit the retrieval of all translations of a work—if these were handled as relationships. Translations also represent a form/genre as discussed above. 245 âKabbalah and alchemy : 952 âItalian translation issued in 1999 as Cabbalà e alchimia. Like notes, current linking entries (76x-78x) are unnecessarily complex. They accommodate all manner of descriptive information. There are subfields to describe a related work’s physical description (^h), series (^k), material specific details (^m), notes (^n), etc. Fortunately, many of these subfields are seldom used or needed. Do these provide insurance in case the relationship references a work not represented in a particular catalog? Whether the options are due to MARC or AACR, is the complexity justified? Three discrete components are needed here: an eye-readable entry for the related work, a numeric link, and information unique to the relationship. Added entries also represent relationships—mostly to names and topics represented by authority records. Relationships are a singular phenomenon, and should be treated consistently. A “title as subject” entry forms a de facto link between two bibliographic records, but is not treated as such, again relying on a uniform title. Even a major copyright lawsuit is apparently insufficient to justify including some prominent relationships between works. In RLIN, if one library supplies an additional entry or link, all benefit: The Wind Done Gone [245 â] Parody of: … Gone with the Wind [6xx ^g^t] However, many libraries only update their local copies of records, and “fuller record notification” features lack a sophisticated synchronization mechanism to ensure that our collective efforts are maximized without overlaying purely local enhancements. A system of discrete bibliographic entities coupled with a consistent linking mechanism would be very powerful and enable more sophisticated retrieval. Cataloging is laden with relationships that are not covered by or that do not utilize existing linking entry fields. XML offers sophisticated linking techniques, and related records need not even reside in the same system to be directly linkable. Literal Transcription, or Bibliographic Management?

People are inconsistent, thus works are inconsistent. The current rules give individual records precedence over the collective bibliographic whole, prescribing careful transcription of data elements verbatim, except as to punctuation and capitalization. While this is supposed to ensure cataloging consistency, most cataloging is copied from records in resource files. Transcription rules have had to soften due to the rise of digital materials and their inherent fluidity. Headings, beholden to description, only partially succeed in providing coherent and unified access to bibliographic resources. The emphasis on description in cataloging is misplaced. Catalogers cannot expect to flourish if they are primarily scribes. Recent job postings give the impression that catalogers who deal with vast complexity (AACR,

126


MARC, etc.) are valued less than “metadata specialists” who apply simplified coding (Dublin Core) for online resources. Cataloging is also a target of outsourcing, usually associated with non-core functions having well-defined boundaries. Cataloging is about as core as it comes in libraries, and catalogs emphasizing sophisticated contextual relationships have the fuzziest of boundaries. Instead of pretending that digital materials are different, cataloging can add value to both the traditional and the digital by focusing on integrated access to all library resources. Catalogers are not the only library personnel who should be concerned about negative perceptions. It is ironic that the exponential growth in information and dazzling system capabilities is being met with a decline in library schools. Digital works represent the ultimate transcription challenge. Providing controlled access to selected quality resources within the chaotic web environment offers catalogers their best hope of continuing relevance. XML was created to provide semantic access in this environment; it is available to librarians and to our competitors. We have the expertise and a distributed network of librarians around the world; why not leverage our unique assets? The following discussion explores some aspects of descriptive cataloging that need reassessment in view of the hyperlinked World Wide Web. Computers facilitate change, yet oddly, there is resistance to changing bibliographic and authority records. Change and control are not necessarily incompatible, however. In AACR, description and access are a mantra. Focusing too much on the piece at hand results in a kind of tunnel vision, especially when it comes to authorities. A broader perspective when formulating headings could result in more consistent groups of related headings being created more efficiently. Determining a suite of headings all at once for the current divisions of an organization likely to publish, or for its historical lineage, offers advantages over intermittently establishing headings as materials trickle in randomly. Such authorities would be useful for archival materials, for organizational websites, and as subjects. Larger organizations could monitor their own authorities, potentially as part of a system of distributed responsibility. Those who have tackled retrospective serials conversion know the value of looking at whole families of titles before making cataloging decisions. How entries fit into indexes deserves more attention, like the proverbial forest obscured by the trees. More perspective could prevent the vagaries of description occurring on individual resources from assuming undue importance. Web searching often readily confirms predominance or patterns, but too often the version occurring on the first publication received persists despite being unwarranted. Cataloging rules could be devised to treat related entries more consistently. Even when cross-references are made, records are often listed by main entry or authorized entry, thus scattering similar works or authorities. Relying so heavily on description results in a bias for inconsistency. Description-Based:

California Milk Quality Act Cheese Quality Act (California) Yogurt Quality Act of California


127

Consistent But Not Literal:

Milk Quality Act (California) Cheese Quality Act (California) Yogurt Quality Act (California) The names of subordinate bodies are prone to vary in their wording sequence, intentionally and unintentionally. Cataloging rules in this area are complicated. Using predominant forms is a useful principle, but may not be the only factor that should be considered. Different forms of names might be designated as principal distinctive entry (to stand alone) and principal subordinate entry. The telephone directory often does a better job of arranging governmental subordinate bodies, subordering them by the operative word rather than by generic terms regardless of usage. Are users really interested in sifting through screens of “Dept.,” “Division,” etc., to find a needed agency? An alternative is illustrated below: Education Board. Fire Dept. Human Resources Dept. Library. Police Dept. Operative words often appear first in Latinate languages, e.g., Biblioteca Pública de Redwood City. The flexible inclusion of qualifiers, before or after generic names, could allow context-sensitive display: Description:

Redwood City Public Library Potential Entries:

Public Library (Redwood City, Calif.) ▼ Redwood City (Calif.) Public Library Problems in the variation of description are exacerbated on the Web. First, many more descriptive sources are available. Second, change is almost assured. Web page creators may change layout and wording for stylistic reasons with little regard for title integrity or consistency. Sometimes it is difficult to figure out the title, even on websites created by libraries. Serials catalogers have long faced this problem, with print titles changing almost whimsically. Recent changes in cataloging rules have addressed this to an extent. Would the following “title” with its fuzzy aggregate identity qualify as a single work? The content has remained essentially the same, and most users would likely prefer the latest content. Entry under the latest title or some sort of formal entry might be in order. AHA Guide to the Healthcare Field American Hospital Association Guide to the Healthcare Field

128


The AHA Guide Guide to the Healthcare Field (American Hospital Association) By contrast, when each issue contains new content or replacement content that users would not consider equivalent or substitutable, successive entry makes more sense. Users will cite and seek titles under the title of a journal at the time it was published. To avoid confusion, indexing citations (component-part linking entries) need to match that title. Title changes often reflect changes in emphasis, responsibility, etc., indicating needed differences in cataloging. If such changes are lumped on a single record, that record can become overburdened with information and be difficult to interpret. Unfortunately, rules have allowed the choice of successive or latest entry. Each is valuable, but under different circumstances. Many publishers have fused author and title in various ways, apparently attempting to create or retain a distinctive marketing identity for various works—even when the author has died. Are prepended character strings any less statements of responsibility than trailing ones? Consider the following: Columbia encyclopedia Hurst’s the heart Mayo Clinic’s complete guide for family physicians and residents in training Maxcy-Rosenau-Last public health & preventive medicine Oxford textbook of ophthalmology Prosser, Wade, and Schwartz’s torts Rook/Wilkinson/Ebling textbook of dermatology This phenomenon is more common in edited, multi-edition works entered under title. The effect is to scatter entries. Uniform titles to organize the editions occurring under bald titles and various permutations of prepended editors are the exception, making it hard to sort out editions in most online catalogs. The display of such titles would benefit both from more emphasis on the title in the context of the title index and from keeping title entries focused on title rather than on responsibility. Potential Entry:

AACR Title Entry:

The Thyroid (Green : 1987) The Thyroid (Hazard : 1964) The Thyroid (Kini : 2nd ed. : 1996) The Thyroid (Kini : 1st ed. : 1987) The Thyroid (McGavack : 1951) The Thyroid (Werner : 8th ed. : 2000) The Thyroid (Werner : 7th ed. : 1996) The Thyroid (Werner : 6th ed. : 1991) The Thyroid (Werner : 5th ed. : 1986) The Thyroid (Werner : 4th ed. : 1978)

The thyroid The thyroid The thyroid* The thyroid* The thyroid* Werner & Ingbar’s the thyroid Werner and Ingbar’s the thyroid Werner and Ingbar’s the thyroid** Werner’s the thyroid*** The thyroid***


The Thyroid (Werner : 3rd ed. : 1971) The Thyroid (Werner : 2nd ed. : 1962) The Thyroid (Werner : 1st ed. : 1955) The Thyroid and its diseases (DeGroot …

129

The thyroid*** The thyroid; a fundamental and … The thyroid, a fundamental and … The thyroid and its diseases*

* Main entry under author ** Added uniform title (730): Werner’s the thyroid *** Uniform title (130): Thyroid (Werner) The actual title edited by Werner has remained constant for half a century, although Werner retired in 1977 and died in 1994. The filing sequence shown above assumes that qualifiers are coded explicitly and that a reverse chronological order system capability exists. While descriptive titles are usually useful, catalogs are failing to deliver clarity where it is needed most—in cases of ambiguous and confusing titles. Authority records are underutilized. Unnecessary duplicative description could be eliminated by including the place of publication once in a publisher’s authority record, and not repeatedly in bibliographic records. The Oxford University Press publishes 3,000 new books each year; library catalogs record its location as Oxford, London, or New York, etc., thousands of times, rather than shedding light on the location of its various offices. The country codes for O’Reilly publications vary and indicate China, Massachusetts, or California, depending on which appears first alphabetically when published, and changes as new places are added to the imprint while the ISBN prefix remains the same. What is the value of this? Eliminating the codes and allowing repeatable places of publication in an authority record would be more accurate, less work, and could further international accommodation. The place of publication is important in some cases, e.g., rare books, but its value has diminished on the Web. Alternatively, these 454 records could link to one authority: Bibliographic Records:

LC Catalog:

Philadelphia, PA : Lippincott-Raven Publishers, 1995. Philadelphia, PA : Lippincott-Raven Publishers, 1996. Philadelphia, PA : Lippincott-Raven Publishers, 1997. Philadelphia, PA : Lippincott-Raven Publishers, 1998.

[16 records] [110 records] [175 records] [153 records]

Authority Record:

Lippincott-Raven Publishers. Located (1995-1998): Philadelphia, PA. The rules regarding recording the shortest recognizable form of publishing information changed somewhat in 2002 (the last example below will not be a prospective problem). However, the emphasis remains on description, and the result varies unnecessarily from that used in name entries. Publishers are usually corporate bodies, but are seldom recorded as entries in bibliographic records. Many authority records for

130


publishers already exist, and it is worth considering establishing relationships between works and publisher authorities. Description:

Authority:

McGraw-Hill McGraw-Hill Book Co. Meister Pub. Wiley The Association

McGraw-Hill, Inc. McGraw-Hill Book Company. Meister Publishing Company. John Wiley & Sons. Research and Education Association.

Keeping up with mergers and splits may be difficult, but worthwhile. Entries for publishers and aggregators are increasingly important in managing digital material licensure, as agreements focus on the supplier. Knowing who owns whom is also valuable. Does this corporate relationship need to be different from other corporate added entries?

Worldwide Cataloging

An international cataloging code is overdue. Efforts toward parallel authority headings in different languages are to be congratulated. The current Anglo-American emphasis results in many mixed-language headings and impedes internationalization. In another example of overreliance on description, some “authoritative” entries result in a language differing from the vernacular used by an organization: Current Mixed-Language Entry:

Germany. Bundesamt für Naturschutz. x Germany. Federal Agency for Nature Conservation. Dual-Language Entries Better?

Deutschland. Bundesamt für Naturschutz. x Germany. Federal Agency for Nature Conservation. The following hybrid Japanese-English authority entry occurs without any crossreferences. The note regarding its data source includes “usage: Import Promotion and Cooperation Dept. Japan External Trade Organization.” The parent entry includes more than one English variant. An improved authority mechanism would need to indicate a preferred entry for each language. Nihon Boeki Shinkokai. Import Promotion and Cooperation Dept. The practice extends to cross-references, as shown below in another authority that includes the note “Tokyo Chapter, American Literature Society of Japan.” Treating subordinate bodies as relationships rather than relying on alphabetical co-filing holds the potential for display subarranged by language.


131

Nihon Amerika Bungakkai. Tokyo Shibu. x Nihon Amerika Bungakkai. Tokyo Chapter. More Useful?

Nihon Amerika Bungakkai. Tokyo Shibu. x American Literature Society of Japan. Tokyo Chapter. The authority below illustrates alternating languages in a heading and mixing languages in the same subfield: 110 10 âCôte d’Ivoire.^tLaws, etc. (Ivoire-codes. Droit privé) XML readily permits entries in an authority record to carry the language and script of each heading as attributes; this is discrete from the entry content. Parallel headings in different languages reflect an equivalence relationship. Mixed-language headings are counterproductive in a global environment.

Pre-Coordination

The effects of name-title pre-coordination form a thread through this section on AACR and the previous one on MARC. Trying to keep selected information together, and separate it simultaneously, has far-ranging implications. This is of particular concern because the coordination is often implied and because most systems index names and titles separately. The implicit bibliographic “entry” below (110/240) is explicit when referenced in another bibliographic record (710 ât) or controlled by an authority record (110 ât). This structure (110 ât) is valid in bibliographic records, but rarely used. This example also illustrates the special use of a uniform title. It represents form/genre (laws) differently than in form/genre headings (655), yet does not serve to group all the laws of France effectively. Implicit (Separate):

110 1_ âFrance. 240 10 âLaws, etc. Explicit (Together):

Authority:

710 1_ âFrance.^tLaws, etc.

110 1_ âFrance.^tLaws, etc.

Usually, such entries must be post-coordinated to achieve the intended pre-coordination. By contrast, the qualification of various headings also represents a sort of precoordination. When qualifiers are subfielded, the resulting entries are structurally similar to the explicit case above. Interestingly, other qualifiers are not subfielded, which actually forces the combination to function as a unit. The two patterns are difficult to reconcile:

132

In the Scheme of Things Qualifier Subfielded:

111 2_ âInternational Railway Congress^n(6th :^d1900 :^cParis, France) 110 1_ âMinnesota.^bConstitutional Convention^d(1857 :^gDemocratic) 130 0_ âTreaty of Medicine Creek^d(1854) Qualifier Not Subfielded:

830 _0 âLove stories (Bantam Books (Firm)).^pSuper edition. 630 00 âRevelations (Choreographic work : Ailey) 710 2_ âUrban Land Program (Asian Institute of Technology) The aim in all cases appears to be to group like records together, while distinguishing them from similar groupings. Consistently delivering reasonable sets of records is difficult with three patterns of togetherness for entries/headings: (1) loose— “coordinated” separate fields, (2) medium—separate subfields, and (3) tight—no subfielding. This ambivalent structure requires additional considerations for catalogers and for system designers. Although not governed by AACR, subject cataloging also relies heavily on precoordination. The desirable degree of coordination remains controversial. However, the aims seem similar to other headings discussed above—creating useful sets of similar records. With too much pre-coordination, too many “sets” consist of only one member; without pre-coordination, many sets become too large to be useful. Is there a happy medium? Subjects are often thought of as topical (650) or geographic (651), although more recently form (655) and chronological (648) ones have emerged. Personal names (600), corporate names (610), meeting names (611), and uniform titles (630) may serve as subjects—with added complexity when subheadings or titles are appended. A nametitle as subject compounds the issues raised above by introducing a third aspect in one field. Subheadings, representing form (^v), topic or language (^x), chronology (^y), or geography (^z), may apply in each case. Despite the predetermined order of subheadings, as in Library of Congress Subject Headings, the number of possible combinations is astronomical. Controlled vocabularies’ accommodation of huge numbers of permutations of various factors becomes counterproductive due to increasing fragmentation. After all, the fundamental purpose of subjects is one of grouping. More sophisticated systems would offer users the option of different subarrangements of a subject depending on their immediate needs. The National Library of Medicine recognized the growing incompatibility of differing practices for MeSH (Medical Subject Headings) in cataloging and indexing and changed its cataloging practice to only allow topical subdivisions. Despite this, there are almost half a million possible valid combinations. The number of LCSH combinations is already in the millions. The FAST (Faceted Application of Subject Technology) Project also recognizes the value of post-coordination (OCLC 2002). While not venturing a definitive answer, we have included this topic to indicate its importance and to suggest that more effective possibilities exist. Crisply delineated


133

information can be manipulated more readily, leveraging libraries’ investments in coding information. We need to review where pre-coordination produces clarity, as with qualification, versus where it fragments and could be easily post-coordinated. Whether or how this differs from relationships merits consideration. Current practice is too heavily based on historical accident rather than sound principles. The problems with subject post-coordination often focus on “false drops,” or retrieval resulting from mixing elements from separate headings that were not intended to be coordinated. Should the following two subjects be assigned to a work, without precoordination, Latvian economics or Estonian history could be interpreted incorrectly. Latvia — History Estonia — Economics How many such false drops could be eliminated if bibliographic records represented one work or one discrete part of a work? The discussion of notes in the “Building Better Relationships” subsection above illustrates how keeping names and titles separate is beneficial; the same applies to subjects. Falsely coordinating subjects may indicate the need for separate, linked records. If separate records fail to circumvent false drops, the subjects covered may be sufficiently diffuse that the degree of false drops would be tolerable. Et Cetera, Etcetera, Etc.

ISBD (International Standard Bibliographic Description) punctuation precedes the bibliographic element that it introduces. It is recorded in the preceding subfield, not in the subfield that determines it. Whenever that data element is separated from the following one for processing or display, the trailing punctuation is no longer meaningful: 245 âAs time goes by / 245 âAs time goes by : 245 âAs time goes by. 245 âAs time goes by 245 âAs time goes by =

[statement of responsibility follows] [subtitle follows] [section no./title or no other elements follow] [general material designator follows] [parallel title follows]

260 âSebastopol, Calif. : [publisher follows] 260 âBeijing ; [another place follows] ISBD punctuation often represents duplicative coding, but MARC and ISBD are not tightly coordinated. The slash indicates a statement of responsibility is next, which is what 245 ^c represents, or sometimes 250 ^b or 505 ^r. MARC does not code responsibility in series and is optional in notes. MARC and ISBD are only superficially parallel: 245 04 âThe plays of Oscar Wilde /^cAlan Bird. 250 __ â4th ed. /^brevised by Pam Miller and Jim Xitco. 250 __ âCanadian ed. =^bÉd. canadienne. 490 1_ âMap / Geological Survey of Alabama ; 505 … Sonata in G minor / Purcell 505 … ^tLectures in accelerator theory /^rM. Month.

134


ISBD’s separate representation of coding invites problems. It is not used outside of libraries and makes incorporating cataloging records into bibliographies difficult. It requires special programming for display and processing, and is often left trailing because the difficulty in managing it is not worth the trouble or expense. XML resolves the problem, neatly, by separating content from presentation. Stylesheets allow the reuse and redisplay of markup in as many ways as desired. However, punctuation can only be associated with data that is discretely identified. Capitalization conventions vary by language. In AACR, the filing word of the title is not capitalized following an initial article in English. When initial articles are removed for simple lists, the resulting string begins in lowercase; other titles begin with uppercase. Some systems compensate for this, but the problem persists when using records outside such systems. Names are often detectable due to their capitalization. Titles represent the names of works and are usually found capitalized like names. Even when completely in uppercase, titles often appear with significant words capitalized elsewhere on the item to distinguish the title from other text: The Best Loved Poems of the American People It seems that more often than not, it requires extra effort to change what appears most commonly. With digital materials, cut-and-paste and automated data-extraction techniques suggest that following what occurs predominantly would involve the least amount of manual editing. In one project, the majority of 6,000 component records created semi-automatically from full-text journal articles used upper/lower case of keywords; it was not feasible to make these conform to AACR because proper nouns, acronyms, and other conventional usages (for example, tRNA) could not be readily detected (Li, Miller, and Buttner 2002). Judging by nearly forty pages of rules in AACR2, capitalization is not a simple issue. However, it appears that rule changes would permit achieving consistency without causing attendant problems. Alternatively, this may be an area where rule options to accommodate common patterns would make more sense than absolute consistency. MARC’s complicated nonfiling indicator system makes matters worse, especially since the offset does not represent the beginning filing character: Offset Matches Filing Character:

245 18 âThe ... annual report to the Governor. 245 10 âBehold man : 245 12 âA chorus line / Offset Assumes Punctuation Ignored:

245 10 â--as others see us. 245 14 âThe “winter mind” : 245 10 â[Diary]


135

The initial articles of section titles are recorded, but cannot use the nonfiling character offset. A repetitious variant title compensates, capitalizing the filing word that was previously lowercase: 245 14 âThe lord of the rings.^pThe two towers / 246 10 âTwo towers Despite AACR’s emphasis on description, initial articles for variant titles are not accommodated. They are discouraged on uniform titles, although MARC defines appropriate nonfiling indicators. Diacritics further muddy the picture. Current practice appears unnecessarily constrained by past limitations. Titles should be treated consistently. In XML, all initial articles, including those on names and section titles, could be accommodated by a single mechanism, attributes. The emphasis on abbreviations is another holdover from the paucity of card real estate. Many abbreviations are not recognized internationally. Often, they serve to obscure keyword retrieval, since they spell other words. Users are rarely aware of variations due to context. Prescribed forms of abbreviations are not routinely found in authority records. But efficient data entry need not be compromised, as brief forms can be expanded automatically. Seven of the following nine abbreviations spell other words: Alta. Alberta arr.

arranger

ill.

illustration, -s

Ill.

Illinois

La.

Louisiana

ms.

manuscript

mss. manuscripts port. portrait ports. portraits Abbreviations need to be reviewed for a consideration of their impact on retrieval. Enumerative rules tend to rigidify practice, rather than embrace the fact that change is inevitable and that computers are good at making substitutions automatically. A flexible framework distinguishing between constants and variables can accommodate change without having to be remodeled each time that new definitions, rule interpretations, etc., are required. These are independent of record content changes. The issues and examples presented here are not intended to represent an exhaustive analysis of the problems in MARC and AACR or completely considered recommendations for changing current cataloging practices. They do highlight the need to ensure that future rule revisions, and hopefully a new international code, will be based on research and practical experience. The pervasiveness and distributed nature of the Web, with nearly instant navigation of relationships, provide a unique opportunity to better

136


integrate library information in this de facto digital environment. Opportunities of this magnitude are rare indeed. A more extensive treatment of many of the issues raised in this section can be found in documentation for XOBIS (Lane 2002b), which is discussed next.

XOBIS: Simplicity without Sacrifice? Stanford University’s Lane Medical Library released the Alpha version of XOBIS (XML Organic Bibliographic Information Schema) in September 2002. This experimental model addresses many of the issues leveled against MARC and AACR and considers broader aspects of information management in a digital environment. XOBIS offers a unique, concrete alternative for reviewing current cataloging practices and for extending traditional bibliographic control into new areas. The Changing Environment

The World Wide Web and XML offer new perspectives from which to reconsider cataloging—perspectives that prompt the questioning of conventional assumptions and bring new meaning to the concept of sharing. Is there too much emphasis on description when digital documents can provide 100 percent of content with just a click? What is the role of cataloging in this open environment? Is it possible to coordinate far-flung efforts sufficiently to produce a coherent bibliographic whole that is virtually more than the sum of its parts? Traditionally, cataloging has been very rule-intensive—more of an art than a science. It often requires specialized knowledge and can involve consulting a wide range of tools beyond AACR2. The result of this cooperative effort does attain a reasonable degree of consistency resulting from the broadly shared application of the corpus of rules and practices. The vast scope and impressive degree of cooperation result in more broadly applied consistency than many other professions can claim. Bibliographic utilities aggregate these efforts, but with restricted access and in growing contention with emerging digital libraries. Due to the Web’s almost complete lack of consistency in regard to content, libraries are in a position to parlay a unique contribution—making the result of our collective efforts available directly on the Web. Of course, a ragtag band of librarians cannot expect to catalog the entire Web. There is, however, the potential to address the educational, scholarly, cultural, historically significant, and other selected areas in which we already exercise a degree of hegemony by virtue of our role in collection development and preservation. Libraries, museums, and related organizations essentially determine what is preserved for posterity and influence current trends by voting with their materials budgets. Melding together digital content that has been deliberately segregated for marketing purposes adds a new dimension to our task. The online catalog forms yet another information “silo” that begs integration with other resources, lest we totally bewilder the user. In an increasingly digital arena, what we select and catalog, or otherwise produce metadata for, is a continuation of our traditional role—updated as quality filtering or


137

virtual collection development and digital preservation. Time is a more indiscriminate filter. Historically, much unshepherded information has been lost arbitrarily. As all bytes are essentially equal, time already shows signs of being even more indiscriminate in regard to digital documents. Maintaining links for vanishing documents seems to be a growth industry. Many folks with their fingers in this digital gumbo do not share librarians’ values. Now that information is sexy, major competitors have appeared from various venues. They are not as constrained by tradition as are librarians. In a sense, nonlibrary organizations are attempting to reinvent cataloging, as they discover the intricacies of information management that catalogers have long known. To stem inroads into our profession, we need to reassess our approach. By combining our unique strengths with the best new technologies, we can become a formidable force for the subset of this environment that really matters—and defend our rightful position. By now XML is the obvious vehicle for libraries to embark on this transition. We can amplify our efforts by combining widely embraced technology and a carefully chosen mix of rigor and flexibility in our cataloging and information management policies. This will necessarily involve change, but change in which we participate. There are many among us with considerable wisdom and a penchant for technology. Those with this inclination are the likely ones to build on our collective experience and the incredible depth of analysis that has gone into addressing many cataloging problems over the years. The process of combining tested best practices in building bibliographic resources with new technology affords an unusual opportunity to incorporate changes to correct what has not worked well within current constraints. Incremental change cannot bridge the current disjunction between the way libraries are treating digital and traditional resources. Against this backdrop, the Lane Medical Library chose to investigate whether the development of a model schema could address the problems identified with MARC and AACR, as well as tackle the problem of integrating traditional cataloging data with a rapidly diverging body of digital library data. A summary of the resulting schema called XOBIS follows. This necessarily abbreviated explanation is supplemented by documentation on the XOBIS website, particularly the authors’ extensive introductory treatise (Miller and Clarke 2002). Structural Overview of XOBIS

XOBIS endeavors to restructure bibliographic and authority data in a consistent and unified manner using XML. Unlike literal mappings of MARC into XML, this entirely new schema reassesses the MARC/AACR approach and attempts to balance their valuable precepts with the realities of an open web environment. Rather than solve one library’s need to integrate cataloging data with burgeoning local digital resources, our effort focused on a generalized solution that could offer an alternative and more expansive model of how this might be done on a larger scale. As in current practice, the level of detail recorded can vary according to external guidelines or policies. The homogeneous XOBIS framework broadly encompasses information managed by archives, libraries, museums, etc. The key features of XOBIS are the following:

138


1. XOBIS provides a structural framework for delineating a record’s control data, content, and relationships. 2. It assigns bibliographic and authority information to ten fundamental categories called Principal Elements. 3. It promotes separate records for each discrete entity, splitting rather than lumping to achieve more manageable data. 4. It represents each instance of a Principal Element with a primary Entry and optional equivalent entries called Varia and Substitutes. 5. It disambiguates instances of Entry and Varia by allowing Qualifiers to be constructed from the same ten Principal Elements. 6. It defines elements generically where possible for reuse. 7. It distinguishes metadata from content via elements’ attributes. 8. It supports the validation of allowable values recursively within the data, insulating the structure from change (the database contains the values, rather than an external list). 9. It provides a single, homogeneous Relationships mechanism to link any Principal Element instance to any other one, thereby unifying access and traditional relationships. 10. It represents Relationships as source-target pairs bearing navigational characteristics and information unique to the Relationship (type, quality, duration, etc.), rather than belonging to either the source or target record. 11. It controls Relationships recursively as Concept authorities. The integrated design of XOBIS attempts to combine simplicity with comprehensiveness in order to resolve problems with the current bibliographic apparatus and to support functionalities that are not readily attainable using the current model. Due to inconsistencies in the degree of application of the voluminous MARC tagging, XOBIS concentrates on a core structural framework and provides a general mechanism by which to incorporate descriptive or secondary information that could be defined separately, if justified. Further review of XOBIS will determine the need for additional specific substructures or another layer of superstructure. These could be added and revised after initial experience without affecting the crisp, tightly integrated core. To test the approach, the Lane Library anticipates mapping its quarter million records into XOBIS during 2003, expecting to learn valuable lessons. We seek input and collaboration in hopes of arriving at a viable version of the schema for practical application following this trial process. Currently, the root element of XOBIS is RecordList, which in turn contains the fundamental Record element. A Record element consists of three required parts, each with further substructure: • a ControlData element containing metadata about the Record • any one of ten Principal Elements covering broad content categories • a Relationships element containing one or more Relationship elements XOBIS divides content into ten fundamental categories, amalgamating bibliographic and authority information into a single, integrated construct. These categories, designated “Principal Elements,” are mutually exclusive; however, their definitions and


139

the precise cusps between them remain under review. Each Principal Element serves as the nucleus of an individual XOBIS Record. All may represent the idea or notion of a given entity. “Notional” ones encompass those entities solely representing intangibles. “Substantive” ones encompass physical entities and closely allied intangibles, e.g., a fictional Being or a digital Work. These distinctions are pragmatic, not philosophical. Each of the Principal Elements may relate to any other of the ten (including another of the same category). “Relationships” exactly parallel the ten Principal Elements to provide structural coherency and to indicate the general nature of the relationship between any given pair of linked records. This parallelism is maintained by a required class attribute on the Relationship element involved. In the example below, an attribute with the value “conceptual” indicates that the target record is a Concept. This could be a “narrower” Concept than the Concept represented by a source record (see related reference; MARC Authorities 550), or a Concept record that is the “topic” of a Work record represented by a different source record (topical subject added entry; MARC Bibliographic 650). The distinction is made as part of the Relationship.

Since each Principal Element is equal, information pertaining to relationships among them is ideally restricted to a specific Relationship element. XOBIS segregates information about relationships because it is inherent in the Relationship itself, rather than belonging to either the source or target Principal Element. Thus, “narrower” or “topic” identifies the name or kind of relationship involved. Discrete details about a Relationship can specify its type (e.g., subordinate), chronological duration, degree (e.g., primary or secondary), and so on. This arrangement of Principal Elements with correlating Relationships provides more granularity than the traditional overlapping bibliographic categories of name, title, series, and subject. It provides a singular mechanism for consistently organizing all types of inter-record relationships, rather than just bibliographic ones. Figure 3-2 sketches out the working definitions of each Principal Element and indicates the parallel Relationship attribute for each. Nine of the ten Principal Elements may be regarded as derivatives of the core one, Concept. Four of them, Place, Being, Object, and Work, represent “substantive instantiations” of selected tangibles, i.e., those that may be collected, owned, licensed, etc. An “authority” attribute accommodates instances of these four that are not held, yet are needed for referencing. This allows Records for substantive elements to represent an instance in a particular collection and/or serve as an authority. For example, one museum holds the Hope Diamond (an Object), but other organizations need to reference the diamond as a subject. The same mechanism allows a Record for a Work, without holdings (akin to uniform title authorities), to serve as a virtual Work in organizing related titles. One record may serve either role, or both. A Version element may exist for any of the four substantive Principal Elements to allow a single Record to represent closely related resources (Works), such as digital, print, reprint, and microform renditions where the content remains essentially the same. To avoid redundancy of data and staff effort, Version IDs permit recording Version-specific

140


Relationships that do not apply to the aggregate Record as a whole. XOBIS does not require this single-record approach, nor does it dictate policy in general, but instead provides structural support for an alternative approach to addressing this troublesome area. Currently, Version exists for Work and Object, but could be extended to the Being and Place Principal Elements as well. These cases are intriguing to consider because XOBIS’s potential for further bibliographic unification has not been exhausted. Personal name changes, pseudonyms, etc., could follow this pattern as Versions on a Record for a Being (one record per person) and yet retain separate identities for linking and retrieval as discussed. Similarly, a place that has changed names, e.g., Christiania to Oslo, could fit the same pattern. The emphasis in XOBIS on linked records versus

Principal Element

Scope (Working Definition)

(Relationship Attribute) Concept (conceptual)

Topical and/or categorical constructs (tangible or intangible) not otherwise instantiated

String (lexical)

Individual or deliberately clustered words or phrases, including numbers, letters, etc.

Language (linguistic)

Specific spoken, written, or signed communication systems

Organization (organizational)

Organized groups, including jurisdictional subdivisions

Event (episodic)

Named macro-events, naturally occurring or conducted by individuals or organizations

Time (chronological)

Individual chronological values or ranges of values (periods)

Place (geographic)

Structures, geographic locations, and jurisdictions, including extraterrestrial ones

Being (vital)

Specific identities of tangible or intangible beings (living or dead) and/or personifications

Object (material)

Manufactured, crafted, or naturally occurring things, excluding Place, Being, and Work carriers

Work (compositional)

Artistic or intellectual creations, excluding those considered Place or Object

Figure 3-2 Working Definitions of XOBIS Principal Elements and Parallel Relationship Attributes. © Lane Medical Library, Stanford University


141

this option for clustered Versions needs further scrutiny. While both can coexist, the indication for one or the other is probably more of a policy matter. The remaining Principal Elements are solely “notional” in that they represent the idea of something rather than a specific instance, and thus cannot be held. They function as authorities only. This arrangement provides consistency and flexibility simultaneously. Figure 3-3 illustrates these basic organizational details, showing how nine of the Principal Elements derive from “collective” Concepts. The value “collective” is from Concept’s type attribute, which provides broad, mutually exclusive groupings for each Principal Element. Automobiles or Ford (Automobile Make) would represent collective Concepts, whereas a particular car would be an Object. Two other attribute values for Concept are “abstract,” Beauty, for example; and “specific,” for example, Heimlich Maneuver, a specific named intangible not covered by one of the other derived Principal Elements. The current attribute values are preliminary.

Figure 3-3 Basic Organizational Detail of XOBIS © Lane Medical Library, Stanford University

142


Substantive Principal Elements illustrate another attribute, role, which indicates whether the record represents an “authority,” an “instance” held in a collection, or both. This permits consistent treatment of these elements, while keeping authorities that cannot have holdings discrete. Substantive Principal Elements may link to Items directly, or indirectly via Holdings, both of which are projected as separate extraXOBIS schemas. The XOBIS schema is envisioned as the core of a larger suite of correlated schemas for the various types of information discussed earlier. Records for each instance or case of any of the Principal Elements carry a relatively unique identity expressed as an Entry element with a unique ID. This Entry may be accompanied by some descriptive elements, although most such information more properly represents various Relationships that a Record instance has to other instances of Records belonging to various Principal Elements. Equivalence relationships (i.e., synonymy, quasi-synonymy, and subsumption) are handled separately within a container element Varia, where each Variant functions as an alternate identity for the Record. There are also four elements called Substitutes: Abbrev, Citation, Code, and Singular. Values of these may be specifically indicated via a substitute attribute as replacements for an Entry in certain places where the full Entry is not desired. The Relationships container element handles most other relationships. Each discrete Relationship represents a source-target Record pair or a “blind” relationship when no target Record exists. The related Record need not reside in the same system. Each Record may relate to one or more Records representing any of the ten Principal Elements (including its own category), as shown in figure 3-4. Individual Versions may carry discrete relationships as well. The Name of each Relationship can be controlled by relationships to other Records within the same system, just one of the many manifestations of recursion found in XOBIS. Various types of relationships are defined as collective Concepts, and instances of these represent valid values for the Names of relationships—in other words, relationship authorities. There are 100 possible source-target pairs or categories of inter-record relationships in XOBIS. They are identified by parallel Relationship attributes. The first example below shows a Concept related to a Time, a conceptual-chronological relationship. The second indicates a Concept related to another Concept, a conceptual-conceptual relationship. In the third example, a Work is related to a Being, a compositional-vital relationship. The last case illustrates a compositional-geographic relationship. Principal Element

Relationship

Principal Element

Phlogiston Phlogiston The Fountainhead The Fatal Shore

flourished supplanted by author subject

18th Century Oxidation-Reduction Rand, Ayn Australia

This simple mechanism unites all inter-record relationships into a single structure. One to many relationships could be recorded as a single virtual or computed relationship,


143

e.g., looking up the 100 chapters (each a Work) in a book (another Work) or finding all the Works on a given topic (a Concept). XOBIS also supports “navigational” relationships, which are categorized to indicate the relative direction of a related record to the source one. The values are: preordinate, postordinate, superordinate, subordinate, associative, and dissociative. These are handled as the type attribute of a Relationship and are intended to support on-screen organization and navigation of this critical information. Other general categories of relationships, e.g., genealogical, may be defined as needed. An excellent overview of relationships was published recently (Bean and Green 2001). Elements are defined once and are reused as needed. For example, the Entry for a Record may have a Qualifier element to distinguish it from the Entry element of similar records; the Qualifier in turn is composed of one or more of the Principal Elements. By using the identical Entry element, or a designated Entry Substitute, in a Qualifier, when an Entry changes, the change can be propagated throughout a database, including where it is used as a Qualifier. This functionality would likely be included as part of a custom editor. This is not possible with MARC. The XOBIS structure is also insulated from change by carrying a minimum of data explicitly within the structure. Instead, such information is represented as Record content, making it relatively easy to add or change values and their definition or scope

Figure 3-4 XOBIS Source-Target Relationships © Lane Medical Library, Stanford University

144


without affecting the schema structure. For example, Serial (a Concept) may be established as having a categorical Relationship to Resource Type (another Concept). Changing existing values or adding new ones is a matter of database maintenance, rather than schema alteration. XOBIS uses the RELAX NG schema to control the values of fundamental attributes unlikely to change after initial testing. Other values, more akin to authority control, reside within the data and will depend on enforcement by a customized editor, similar to the control of Qualifiers discussed above. Recursive referencing of the ten Principal Elements in XOBIS eliminates structural redundancy and the potential for variation in the treatment of the “same” data element in different parts of the schema. For example, the Time element identifying a particular chronological value is the same one used to record the date of a record’s creation, the date of a work’s publication, birth and death dates in a personal name, a date qualifier in a uniform title, the duration of a relationship, etc. These may reference a defined Time Principal Element (authority) or simply imply that the undefined value is temporal (an uncontrolled term). The rigor with flexibility makes for elegant simplicity, but requires awareness of the positional context to distinguish between the different usages. While also verbose, the schema’s granularity and consistency will make it easier to manipulate data. The high degree of recursion can be awkward for humans; a planned XOBIS-specific editor will endeavor to make much of the similarity transparent for data entry. Recursion will make developing the editor easier. In addition, more consistent and flexible indexing is anticipated. One of the highlights of developing XOBIS was the realization that the evolving structure of its schema reflected another case of recursion and tangled hierarchies so eloquently articulated by Douglas Hofstadter in Gödel, Escher, Bach (Hofstadter 1979). This permitted wrapping up “loose ends” with further accommodation of recursion to solve design problems and to achieve a stable, coherent structure. While topic and category (form/genre) were initially separate, these were reinterpreted as kinds of Relationships. For example, a Work can belong to the class Periodicals or be about Periodicals. The value (Periodicals) and the relationships to it (Topic or Category) are all examples of the Concept element. By designating a particular Concept as an exemplar of another Concept, authority control can be self-referencing. A design criterion for the XOBIS editor will be to verify valid values for any given Concept by looking in the database to identify cases assigned to that Concept. By including definitions in each Concept record, the scheme also becomes self-documenting. The Alpha version of XOBIS treats Relationships as a special category of Concept. The design implies that Records have categorical Relationships to one of the ten fundamental categories (Principal Elements) that will be needed to “seed” a system. XOBIS represents an experimental model, one needing further analysis and testing. However, it is a concrete example of what a web-oriented schema for replacing MARC might look like. It should not be interpreted as minimizing the problems such an undertaking would entail, but hopefully it can help focus such problems in view of the realities of an increasingly distributed digital environment.

XML Tools What Do You Want to Do Today?

he digital materials librarian at Lane Medical Library, on occasion, emphatically states, “XML does not do anything.” Smiling, we all nod in agreement; we know she is right. What then should we do with XML? There are many things that a library might do with data that has been marked up in XML. This chapter attempts to provide a general overview of some of the tools that a library interested in doing something with XML might want to use. There are many types of tools available to the librarian interested in working with XML. Some are well established, or are as well established as tools for such a new technology can be; others are a snapshot of the cutting edge of XML research. Some are owned and promoted by large multinational corporations; others are developed by individuals for purely pragmatic reasons. There are, of course, many others that fall between these two extremes. XML applications might be classified by what they accomplish: some edit XML documents, some transform XML, some display marked-up information in a userfriendly way, some store and index XML, and some do a number of these things. The reader should be aware that the applications discussed in this book are really just the tip of the iceberg. For a comprehensive list of free XML-related applications, consult Lars Marius Garshol’s list of “Free XML Tools and Software” (Garshol 2003); for a site that includes these tools and commercial alternatives to them, visit XMLSoftware.com (Tauber and van den Brink 2003).

T

OPEN-SOURCE SOLUTIONS Readers interested in XML will probably also have heard of open-source software. For those who have not, open-source software or “free software” (which is actually a subset of open-source software) is software whose source code, the human-readable 145

146


instructions used to create the program, is made available for download with the program itself. Free software (“think free as in freedom, not free as in beer” [FSF 2002]) goes a step beyond open-source software by requiring that any changes made to the program be redistributed with the program under the same conditions. By requiring that changes made to the code be incorporated into the common code base, authors of free software and, to a lesser extent, open-source software guarantee that users of these applications continue to benefit from the free exchange of ideas fostered by such cooperation. It is not surprising that the free and open exchange of ideas appeals to many librarians. Not many have entered the profession to make their millions. Recently, opensource software has become as hot a topic as XML for many of the same reasons; the primary one being, in the authors’ opinion, that openness encourages discovery, and the discovery of new knowledge and scientific methods results in improvements that benefit our society as a whole; this is something in which many librarians are interested. Readers looking for a detailed exploration of the relationship between opensource software and the library community should read “Open Source Software and the Library Community” (Clarke 2000). For more information on current open-source projects in the library community, visit OSS4LIB (2003), a website dedicated to promoting open-source development in the library community. Open-source software is mentioned here because all of this chapter’s applications, with the exception of the first, a simple text editor, are open source. This is, in part, because we at Lane Library prefer to use open-source software; these are the tools with which we are familiar. However, it is also because open-source applications are often the ones changing the way XML is processed and managed. There are many fine commercial alternatives, but often the functionality they provide does not surpass that of open-source applications. Open-source software is also highlighted here because libraries are occasionally faced with difficult choices as a result of fiscal limitations. While “free software” does not mean it comes without cost (e.g., initial setup and ongoing maintenance costs), the total cost of owning open-source software is frequently much less than the commercial alternatives.

XML EDITORS: MARK IT UP! XML editors are probably the most common type of XML application available today; there are a wide variety of choices. They range from simple text editors which just enforce that a document be well-formed to editors that are a part of an XML, or programming, IDE (integrated development environment). Which editor a person chooses depends, in part, on what other tools are already being used. For instance, if a systems librarian is currently using the NetBeans Java IDE, the XML editor that comes packaged with that product will probably be the best choice because of its familiarity and convenience. This section may prove useful for librarians who are new to XML because it discusses some of the options to consider when picking an XML text editor. Even if a librarian already has an editor that does a good job, this section might prove interest-


147

ing. The world of XML is rapidly changing; new products that do things better, and more efficiently, are appearing regularly. For instance, the XML eXchaNGeR (XNGR) discussed in this chapter is one example of how a modular editor and XML “desktop” might change the way librarians work with XML. Picking which editor will work best in a given library depends on a number of factors. Some of these considerations will be discussed in more detail under each of the following examples.

Simple Text Editors What is the bare minimum one needs from an XML text editor? Since XML documents are just text documents, there is no need to use a specific XML editor. Any word processor, such as AbiWord, OpenOffice, or Microsoft Word, will do. Even the most basic text editor, for instance, the Notepad program that comes with Microsoft Windows, will work. XML, when viewed in a program like Notepad, looks just like any other text document. Probably the most important thing to consider when selecting a simple word-processing program to edit XML documents is to make sure that the program supports the character set of the document in hand. It is especially advisable to check for Unicode support. Many people use UTF-8 as the character encoding for their XML documents, but not every plain text editor will handle it. Notepad is a good choice for a simple XML editor because it has Unicode support. Once a text-processing program’s UTF-8 support has been verified, any type of XML document can be edited using the editor. This is one of XML’s strengths; since it does not rely on proprietary encodings to represent structure, XML ensures there is a low entry level for learning about and using marked-up text. This does not mean that a simple text editor is necessarily the best choice for an XML editor. When using an editor that is ignorant of XML, all the basic rules of XML well-formedness need to be enforced by the person working with the XML document. This means that proper XML syntax is the responsibility of the document’s creator; Notepad will not provide a warning if an XML document is missing a closing root element, for example. Even if a document’s author or maintainer does not mind enforcing XML’s criteria for well-formedness, there are other reasons for selecting an editor that is more XMLaware. Suppose, for instance, a document needs to be verified against a RELAX NG schema. If a plain text editor like Notepad is used, verification must take place in a second step and will have to involve another program that processes the schema and XML document. Also suppose that the RELAX NG schema against which our XML document needs to be verified is very complex; if we are using a simple text editor, we must remember which elements follow other elements and which attributes are acceptable for any given element. Another reason one might not want to use a simple text editor to edit XML documents is that there is no way to display how the XML might look to a patron. Without using an external web browser like Mozilla, the document’s author cannot see what a stylesheet will do to the XML data. Also consider that since an XSL stylesheet—a set of instructions that tell an XML-aware browser how to display XML data—is represented

148


in XML, the document’s author might want an XML editor that can intuitively assist with the editing of XSL stylesheets, in addition to documents with data marked up in XML. Of course, external tools can be combined with a simple text editor to satisfy these other requirements, but why not select an editor that can do it all? Despite these shortcomings, probably everyone who has ever worked with XML has edited it in a plain text editor at one time or another. While this might not be the best alternative for a full-time editor, the convenience of being able to read XML in any text-based program is certainly one of XML’s strengths. Imagine, for instance, that an XML document has been sent to a machine on which there is no XML-aware editor. No problem! Just open the document in a basic text editor, edit it, and save it. If the document was in the Word format, it would have to be sent back to a machine with an editor capable of reading that proprietary format. A plain text editor might prove useful to someone just starting to explore XML; using a plain text editor, like Notepad, is a simple way to work with XML without having to download any extra software.

JEdit The next “level” in XML editing is to use a text processor that has been designed specifically for markup and programming languages. One such program is the versatile JEdit editor. A program like JEdit offers many features that a basic word-processing program will not. For instance, even though the gray-scale images in this book will not show it, JEdit color-codes the parts of an XML record in order to make it easier to read. Element values may be black, element names green, and XML tags red. These color schemes are configurable, allowing users to change JEdit’s default settings. In figure 4-1, the changes in color between elements, attributes, and text values can be seen in the shades of gray used to illustrate each. There are also other advantages to using an editor specifically designed to handle special types of text; for instance, XML documents can be indented and formatted. They can also be easily validated against an external DTD. Elements can be automatically completed and the next element can be intelligently supplied. In addition, the structure of an XML document may be shown in its hierarchical form in a separate window, as seen in figure 4-1. This external view can be used to navigate through the textual representation of the document. Since JEdit is a general-purpose markup and programming editor, these additional features may be added by downloading additional plug-ins from the JEdit website (Pestov 2003). Once these modules are downloaded, installing them is easy. JEdit also provides a built-in plug-in manager that will download additional modules from the JEdit website at the request of the user. JEdit plug-ins, like the program itself, are freely available and cover a wide range of extensions to the JEdit program. One particular strength of the JEdit program is its active developer community. It is easy to find just about any kind of plug-in for JEdit because the developer community is so large. JEdit is a good example of an open-source program that thrives because it meets the needs of the community that uses it. People develop for it because it provides a good foundation on which additional functionality can be provided. Getting


149

support for the program is also easy, since many people using the program also participate in the discussion on the JEdit mailing list. While JEdit might make a fine editor for someone who also programs, or who works with raw HTML, its many features can be a little confusing to someone who only needs a simple XML editor. The program gets its well-earned reputation because of the wide variety of programming languages that it supports. This flexibility, however, means that there are many configurable features in the program that a user will need to adjust to suit a particular style or type of editing. For someone who likes editing raw XML markup, as many do, JEdit is the ideal choice. It makes reading and working with XML markup easy, and provides a number of enhancements that plain text editors do not provide. We find it easy to recommend JEdit, because we use it ourselves. It may not be the ideal choice, though, if a user is looking for a WYSIWYG (“what you see is what you get”) interface to an XML document, or for someone who just wants a simple form-based editor. For a more abstract interface, users should consider one of the next two editors.

Figure 4-1 Screenshot of JEdit’s XML Tree Functionality Copyright © [2003] Slava Pestov (http://www.JEdit.org)

150


XMLOperator XMLOperator is an editor that attempts to make the process of editing an XML document easier by abstracting the structure of XML into an easy-to-use interface. Instead of displaying raw markup, like Notepad or JEdit do, XMLOperator reproduces the hierarchical structure of XML in a folderlike display that uses text boxes, in a different frame, for the input of element, attribute, and note values. This does not mean that a document’s author is limited to this interface; the program also has the option to edit the raw XML markup as well. For this reason, XMLOperator makes a great choice for someone who has an existing DTD or RELAX NG schema against which a new or existing document needs to be validated. It also makes a good choice for someone who wants to create a new RELAX NG schema from scratch. XMLOperator is the first editor we have examined that separates the editing of a document’s structure from the editing of its content. While, in the upper frame, elements are shown in pseudo-markup, the structure of the document is represented by a hierarchy of folders that can be opened or closed to hide or reveal specific parts of the XML document. (See figure 4-2.) When an element, attribute, or any other part of an XML document needs to be edited, clicking on the component in the top frame reveals a simple form-based editor for that component in the lower frame. Changes made to content in the lower frame are automatically reflected in the document in the upper frame. This separation of the document’s structure and content allows for changes to be made to a particular part of the document in the lower frame while viewing other related parts in the upper frame. This cannot be easily done in an editor, like Notepad, that displays the whole document in a noncollapsible, text-based form. While XMLOperator is still relatively new compared with products like JEdit and Notepad, it shows great promise. Since some of the program’s menus lack an intuitive organization, there are many “hidden” features that might prove useful to a librarian editing lots of XML documents. For instance, the program logs all changes made to a document. This log is recorded in XML, so it may be useful for creating a graphic representation of a document’s changes (using the Scalable Vector Graphics of XML SVG) or for applying the same changes to a number of XML documents. There is also a way to associate two elements in different documents, and to merge two documents that have been associated with each other. Like JEdit, XMLOperator assists with the creation of a document based on the schema used to validate that document. If a user chooses to add a child element, the editor will provide a list of elements that are acceptable children according to the document’s DTD or RELAX NG schema. Individual elements may be copied and pasted using XMLOperator and, more importantly, large fragments from a document may be handled in the same way. These time-saving features can be invoked from the program’s menus or, quickly, through special keystrokes. This feature is common in text-based editors, but is not so common for form-based editors like XMLOperator. XMLOperator also associates an XSLT stylesheet with an XML document. When additions or changes have been made, the document can be processed using the associated stylesheet. The program does not at this time have a way to display the resulting


151

HTML output (if that is what the transformation creates), nor does it have a way to associate a CSS stylesheet with the XML for viewing in the editor. For the time being, an external program, like Mozilla or Internet Explorer 6, must be used to view the edited document in its patron-friendly form. Figure 4-3 shows an HTML page created using the XMLOperator’s built-in “RELAX NG to HTML” XSLT stylesheet. The XML for this illustration is the same as that used in figure 4-2. Since one of the strengths of XMLOperator is its guided creation of documents or schemas that conform to a DTD or RELAX NG schema, the program also provides a way to register new DTDs and XML-based schemas against which new documents can be created. This is done through the “Document” menu (compare figure 4-2). Opening this menu, one sees the option to view or register a DTD or RELAX NG schema. To register a new one, assign a name, a full file system path, and an optional XSLT

Figure 4-2 Screenshot of XMLOperator Editing Data-Centric XML Copyright © [2003] Didier Demany (http://www.xmloperator.net)

152


stylesheet; the program is then ready to create new documents based on it. XMLOperator comes packaged with schemas for SVG (Scalable Vector Graphics), XSL, RELAX NG, HTML (using the XHTML DTDs), and generic XML. XMLOperator is a good XML data editor for experienced and novice XML users alike. Though still in the early stages of its development, it is already a useful tool that is full-featured enough for daily use. While it does have some room for improvement, we expect to watch it continue to develop, especially as RELAX NG continues to grow in popularity. XMLOperator succeeds at allowing an author to edit XML data in a textbased and form-based manner, but it does not, nor does it plan to, as far as we know, support editing in a WYSIWYG manner. We encourage users interested in a simple, full-

Figure 4-3 An HTML Representation of XOBIS, an XML Schema Developed at Lane Library Copyright © Lane Medical Library, Stanford University


153

featured XML editor to download XMLOperator (Demany 2003) and try it out; for those interested in an easy-to-use WYSIWYG editor, we recommend the BitFlux Editor.

BitFlux Editor Though the development of the BitFlux Editor (BitFlux 2003) has lagged in the past, its developers have recently renewed their commitment to its improvement. This is fortunate, because the BitFlux Editor offers a unique set of features not found in any other WYSIWYG XML editor. For instance, the BitFlux Editor does not require any client software other than a web browser. This makes the BitFlux Editor, at the very least, a convenient choice. Be aware that BitFlux’s strength is editing XML documents that already exist. Since BitFlux uses a web browser to edit documents, the file to edit must already reside on the server. It would not be that difficult to write a small script to create new files, but this is an extra step not needed by editors that operate on local files. The reason the BitFlux Editor does not require any additional client software is that it is written entirely in JavaScript, the main scripting language of the World Wide Web. There is only one caveat: the web server that serves the BitFlux Editor pages must be configured to use PHP, a popular server-side scripting language. This dependency is required in order for XML files to be saved to the server. Websites on PHP (Apache 2003b) and the Apache web server (Apache 2003) provide information on how to build the server with PHP support. The BitFlux Editor works by loading an initial web page that then loads the editor’s scripts and XML configuration file; the configuration file informs the browser where the XML data, XSL or CSS stylesheet, and XML Schema are located. The editor can also accept other information passed in as HTTP parameters. One important limitation of the BitFlux Editor, which librarians considering it should be aware of, results from its use of JavaScript: a data file cannot be retrieved from a server other than the one from which the initial file and configuration file are retrieved. The only way around this would be to set up a proxy so that the data file seems to be coming from the same server as the one from which the initial page was downloaded. As figure 4-4 illustrates, the BitFlux Editor fits seamlessly into the web browser’s window. The editing options are displayed at the top of the page, allowing convenient access to the editor’s available functionality. There is also, at the top of the page, the option to view the raw XML, or to refresh the page using the currently selected stylesheet. One thing many readers will probably notice about the BitFlux Editor screenshot in figure 4-4 is that viewing XML content in the browser, with the assistance of a stylesheet, allows the librarian to view the page just like the patron will see it. This simple approach to editing is especially useful for librarians who do not want, or do not have the time, to learn a markup language. Though the particulars of the markup language are abstracted by the WYSIWYG interface of the BitFlux Editor, knowing where one is in the predefined structure is important for all but the most trivial of XML documents. The editor accomplishes this task by a modest XPath expression that is displayed at the bottom of the browser window in the status bar. Clicking on different parts of the displayed page identifies the

154


location in the XML structure at which the cursor is positioned. This simple path-level interface allows the BitFlux Editor to provide a relatively uncomplicated interface to what might be an extremely complex document structure. When considering the BitFlux Editor, it is important that a library make sure that all its browsers are up-to-date. Several important features of the program will not work, or will not work as well, with older browsers. For instance, a browser needs to be able to turn on and off the cursor that displays on-screen. While one can edit without having the cursor displayed on the screen, it is much more difficult. For this reason, the developers of the BitFlux Editor suggest using, at the very least, Mozilla 1.0. Future support for

Figure 4-4 WYSIWYG Editing with the BitFlux Editor Copyright © [2003] BitFlux Editor, BitFlux (http://www.BitFluxEditor.org)


155

Internet Explorer is planned. Figure 4-4 shows the cursor turned on in Mozilla 1.1; in the illustration, the cursor is positioned just after the line, “This is a list. . . a second list.” The BitFlux Editor will appeal to librarians who do not want to become too closely acquainted with XML (or XHTML). It provides many conveniences designed to insulate the document’s author from having to know too much about the XML specification. For instance, do librarians really want to remember the XHTML/XML entity references for every diacritic they might have to use during the course of the day? The BitFlux Editor provides a simple pop-up window that displays all the possible diacritics associated with a particular character set. Inserting one into the current document is just a matter of clicking on it. (See figure 4-5.)

Figure 4-5 The BitFlux Editor’s Character Selection Tool Copyright © [2003] BitFlux Editor, BitFlux (http://www.BitFluxEditor.org)

156


Some librarians will prefer JEdit’s intelligent text-based approach. Others will prefer XMLOperator’s hands-on, data-oriented approach to editing. Still others will prefer to edit at the content level, all but ignoring the structure of the XML markup. For those more interested in content than markup syntax and structure, the BitFlux Editor is a good choice. Despite the fact that it has been sporadically developed in the past, the current version is mature enough (and fast enough) to be used to create and manipulate a wide variety of text-centric information. It is also actively undergoing an interface overhaul. It is possible that the BitFlux Editor could be used to process XML that is more data-centric, but it is probably best to use an editor like XMLOperator for this. If the reader likes a purely text-based approach, JEdit may be used regardless of the type of document (e.g., text- and data-centric documents are treated the same). On the other hand, if one is adventurous and willing to rethink what an editor should be, the XML eXchaNGeR might be a good option.

XML eXchaNGeR (XNGR) The XML eXchaNGeR (Cladonia 2003) is the product of an open-source project that intends to reinvent the concept of what an XML editor should be. Rather than a monolithic program that does one thing, XNGR conceives of the XML editor/editing environment as a variety of services that can be incorporated into a single interface. An XML service, in this respect, is a group of processes, or procedures, that one might want to perform on a particular type of XML document or data fragment. A service is not handled by the program that provides the user interface, but by a variety of modules that can be plugged into the XNGR hub. For example, an XHTML document on a remote server might need to be edited in an environment that displays the page as an editable hierarchy of XML tags and in the same way a browser would render the page. A Scalable Vector Graphics document might need to be viewed using an SVG graphics program. A SOAP (Simple Object Access Protocol) XML fragment might need to be processed locally and returned to a remote SOAP server. Each type of XML document has different needs, so why should they all use the same generic XML editor? XNGR answers: “They shouldn’t. Each should be processed according to its particular type and unique needs.” To accomplish this, XNGR creates an XML “desktop” that serves as a hub for XML types. Each type of XML document needs to be registered with the XML desktop. The most basic example is an XHTML document. To see how the XHTML document will be displayed in a patron’s browser, the XHTML service must be registered. The same applies to more complex XML documents like XML SVG images, for instance. Once an SVG image’s type is registered with XNGR, the XML-based image format can be viewed using the Batik toolkit, an open-source SVG toolkit released by the Apache Software Foundation. Figure 4-6 shows an XNGR desktop that has both XML SVG and XHTML services registered. Behind the XNGR desktop in figure 4-6 is an XML Explorer. XNGR’s Explorer acts like a file system for XML documents. Whether XML documents are stored locally


157

or remotely, they can be accessed from XNGR’s Explorer as if they were stored on a single local file system. XML files accessed by XNGR’s Explorer may also be subarranged by categories, effectively grouping different sets of files, projects, or XML services. Once files in the editor’s Explorer are associated with types registered with the XNGR desktop, a small icon will display beneath the file name; this indicates that the file may be processed by its related service. What happens to a file processed by its service depends on the service with which it is associated. For instance, figure 4-7 shows an SVG file that is displayed with the help of the Batik toolkit. Though XNGR is a relatively new product, its possibilities are exciting. Currently, the program includes services for SVG images, XHTML documents, SOAP fragments, and a few others. One possible downside of XNGR is that some programming code must be written to associate a new service with a document type. If all that is desired is a viewer that displays a formatted version of an XML document, then XMLOperator

Figure 4-6 XML eXchaNGeR’s Explorer Copyright © [2002–2003] Cladonia Ltd. (http://cladonia.com/)

158


or JEdit might be a better choice. If, on the other hand, one is willing to develop a more sophisticated service, one that perhaps allows a document to be edited using an editor written specifically for that document type, then XNGR’s potential is attractive. Suppose that the library community agreed on an XML standard to replace MARC. The next generation of catalog editors could be services that were integrated into what Art Rhyno, a systems librarian at the University of Windsor, has called a “Library Application Framework,” rather than tightly integrated parts of an integrated library system. In the future, “best of breed” editors could be written and swapped in and out of our library systems without the massive upheaval currently caused by a change in ILS vendors. This, however, is still a dream. Editors like XNGR only hint at the future of XML development. Still, for those librarians who enjoy riding the wave of future possibilities, XNGR may hold interest.

Figure 4-7 Active SVG Module Running in XML eXchaNGeR Copyright © [2002–2003] Cladonia Ltd. (http://cladonia.com/)


159

XML TRANSFORMERS: THE CHANGING FACE OF XML After editors, perhaps the second most popular type of XML tool is the XML transformer. These tools, which utilize XML’s powerful transformation languages (XSLT and XQuery), can be used to query and format XML documents in order to extract relevant information and prepare the documents for display or further processing. The output from an XML transformation might be XHTML, a CSV file, a PDF, plain text, an XML-based image file, or any number of other possible formats. Transformers have also been used to query large sets of individual XML documents. Since XML transformation languages, like XSLT, are expressed in XML, XSLT stylesheets can be edited using a standard XML editor. This syntactical generality is perceived by many to be one of XML’s greatest strengths. As a result, an XML developer does not need to develop a new tool for every project. Despite this, there are XML editors that are specifically designed to edit XSLT stylesheets. These make the process of working with XSLT stylesheets easier because changes can be made and results viewed in the same window. Unfortunately, for those interested in these types of editors, there are not, at this time, any XQuery-specific editors. If the W3C specification for XQuery becomes more popular, we might see some in the future. Though there are not as many transformation engines as there are XML editors, this is probably only because the ones that do exist work so well that there is little incentive to create new ones. For the purposes of this chapter, transformers will be divided into those that implement XSLT and those that implement XQuery. Of the two languages, XSLT will probably appeal to those readers familiar with XML’s other related technologies; XQuery often appeals to those who have previous experience with SQL, an ANSI-standard language that is used to query and extract information from relational databases. Of the two XML transformation languages, XSLT is by far the most widely used. XSLT is also one of the oldest XML-related technologies.

Saxon The grandfather of all existing XSLT transformation engines is probably Saxon, developed by Michael Kay; Kay also actively participates in the ongoing development of the XSLT standard. Saxon is included here for many reasons, the primary one being that it is probably the easiest XSLT transformation engine to use. Simply supply the XML source file, its stylesheet, and the desired output file to the Saxon executable and the output is created. For an illustration of Saxon in use, see figure 4-8. In the illustration, no output file is given; when this happens, the HTML output is returned to the screen. In figure 4-8, the first word “java” calls the Java Virtual Machine and tells it to run the next word, “com.icl.saxon.StyleSheet.” This is the name of the Saxon class, or program, that takes the initial arguments we have supplied and starts the transformation engine. After this, options like “-a,” which indicates that the stylesheet is specified in the XML document, may also be included. In our example, though, we just provide the name of the XML file and its stylesheet. If we want to send the output to a file that can

160


be placed on a web server, we would need to add “-o ” to indicate the name of the resulting file. This very simple example illustrates the easiest way to use Saxon to transform XML into HTML. Suppose, instead, that some method of dynamically rendering XML using an XSLT stylesheet is required. A command line will not work. For cases like this, Saxon provides a Java servlet, a web-based Java program, that takes an XML document and its stylesheet as input and returns HTML, or XHTML, to the web browser. A servlet container, like Apache’s Tomcat, is needed to use this functionality. Saxon can also be integrated with other applications if needed. However, most librarians unfamiliar with the Saxon program will probably first want to download the Java jar file, or Windows executable file, and experiment with the command line version, since it is the easiest to use. Saxon is a good choice for XML transformations because it is a very mature product. Saxon is one of the fastest transformation engines, and the Saxon website (Kay

Figure 4-8 XML Transformed Using Saxon and an XSLT Stylesheet Copyright © Lane Medical Library, Stanford University


161

2002) has a lot of documentation describing its many features and extensions. For instance, stylesheets using Saxon’s extensions can call and interact with Java classes as well as perform procedures that are not supported by the XSLT specification. Since many other XSLT engines have also started adding these extensions, the EXSLT group was formed to try to standardize the expression of these extensions. Saxon supports the standardized extensions as well as many others. Kay recommends that the EXSLT versions be used in place of Saxon-specific extensions when possible.

Cocoon and Xalan Another stylesheet engine that is notable because of its inclusion in many XMLrelated projects is the Xalan XSLT processor. Since the Apache Software Foundation develops Xalan, it can often be found in larger Apache Software Foundation projects, like Cocoon. Cocoon is a presentation engine and web server framework that can be used to render XML documents, or XML data streams, in a patron-friendly format that is viewable from any web browser. With the assistance of Cocoon, a library’s website can be viewed in a web browser, on an Internet-capable cell phone, with a text-only browser, or through many other types of Internet-ready appliances. Xalan and Cocoon, along with the Apache project’s Tomcat server, enable a library to build its entire website from a well-organized collection of XML files. Cocoon, using Xalan, can also display data that comes from a database or external web service which is capable of emitting XML. The Lane Library, for instance, uses Cocoon to run the library website. All the pages available from the library’s website are either marked up in XHTML or constructed from XML fragments that are then transformed by Xalan and the Cocoon infrastructure. Though instructions for setting up and using Xalan, through Cocoon, are really beyond the scope of this chapter, the many features of the Cocoon framework may help other libraries decide whether they should further investigate Cocoon. Lane Library staff have been very pleased with the results of using the Cocoon software. Cocoon was started in an attempt to simplify the process of documenting all the software projects gathered under the Apache Software Foundation’s auspices (Apache 2002). Before Cocoon, most options for the management of a website involved trying to hide the complications of using HTML by burying the structure of the site’s web pages under a WYSIWYG interface. The Apache project recognized that rather than building on the existing, but limited, HTML architecture, web designers needed a new way of processing XML, a format without all the limitations of HTML. The Cocoon project calls this new way a “Separation of Concerns” (SoC) design. The four website concerns are Management, Logic, Content, and Style. The SoC approach improved the process of creating and maintaining websites by separating the concerns of website development into relationships between groups of people, each specializing in a different type of development. Management is related to each of the other concerns, but Style and Logic are each only related to Management and Content. Using the Cocoon framework, those working with the logic of a website

162


do not need to be burdened with areas that are not specifically related to their area of responsibility. For instance, using Cocoon, a library might assign the task of creating a website’s style, based on its content, to the library’s information services librarian, and assign the programming logic, which is also dependent on content, to a systems librarian. This separation of concerns ensures that each librarian only needs to be concerned with a particular responsibility, not with the website as a whole. As a result, the systems librarian does not need to be concerned with how the content will be displayed, but only with how it needs to be logically processed; the information services librarian does not need to know how the data is internally processed and manipulated, just that the results should be displayed to the patrons in a particular way. Cocoon provides five basic mechanisms that process XML documents according to the Separation of Concerns principle. 1. The first mechanism is dispatching based on matchers. A matcher tries to match a URI with a specified pattern. Matchers specify a particular XML processing pipeline that a matching fragment or document should enter. In Cocoon, matchers may be basic wildcard matchers or regular expression matchers. Regular expressions are a complex pattern-matching language used by computer programmers. Other user-defined Cocoon matchers may also be incorporated into the process. 2. The second mechanism used by Cocoon is to generate compiled programs based on an XML source. The output of a program that has been generated using a Cocoon generator is an XML document. Compiled programs may be cached, for better run-time efficiency, and processed by different processing pipelines. The Cocoon package includes a file generator, a directory generator, an XSP (Extensible Server Page) generator, a JSP (Java Server Page) generator, and a Request generator. As with matchers, user-created generators may also be added. 3. The third mechanism used by Cocoon is the transformer. A transformer may be an XSLT transformer that uses Xalan to transform XML, or it may be a log or SQL transformer. These transformers all succeed because of XML’s uniform syntax. Without this, standardized processing would not be possible or would, at the very least, be much more complicated. In general, transformers are used to map one XML structure into another. As with matchers and generators, transformers may be created by the user and added to the process. 4. The fourth Cocoon mechanism is the serializer. A Cocoon serializer renders an input XML structure into some other format. Different types of serializers that are included with Cocoon include an FOP (Formatting Objects Processor) serializer, an HTML serializer, a text serializer, and an XML serializer. The FOP serializer produces PDF (Portable Document Format) files, the HTML serializer renders pages viewable in any browser, and the text serializer produces various formats of text. Adding new serializers to Cocoon is relatively easy. 5. The last Cocoon mechanism is the sitemap. A Cocoon sitemap contains configuration information for the Cocoon engine. Sitemaps typically contain a list of generators, a list of transformers, a list of serializers, and a list of processing pipelines, among other things. The sitemap, like most other things in Cocoon, is an XML file that has a corresponding DTD. Sitemaps are generated by compiling an XML source into an object tree.


163

As is probably apparent, the Cocoon framework is considerably more complex than just using Saxon to transform XML documents. This is because Cocoon is a publishing framework that just makes use of the Xalan transformation engine, which is the Apache counterpart to Saxon. Keep in mind that it is possible to maintain a website of static XHTML pages that have been converted from XML using Saxon, but if a library wants to dynamically add, remove, and transform a variety of XML sources, a publishing framework like Cocoon should be considered. For more information on the Cocoon framework, visit its web page (Apache 2002). If, on the other hand, a librarian is responsible for maintaining a small or mostly static website, Saxon might be the better solution.

Kawa and Qexo The last XML transformation engine this section covers is Kawa (Bothner 2003), a product affiliated with the GNU (GNU’s Not Unix) Project. Kawa is actually an evaluation engine that can process many different languages (including many that are not even XML-related) and create Java bytecode, a computer program run by a Java Virtual Machine. For the purposes of this section, we are interested in Kawa’s XQuery implementation. Readers might also be interested to learn that Kawa also has a new XSLT implementation; it is not discussed in this chapter because it is still too unreliable for use in a production system. However, once Kawa’s XSLT implementation is more stable, the ability to compile XSLT stylesheets might yield improvements in the time required to transform XML into something a little more patron-friendly. The component of Kawa that is used to process an XQuery query is called Qexo. Qexo is an XQuery-to-Java compiler. Qexo can either evaluate an XQuery at run-time or save a query as a Java program that can be run at a later time. This means that Qexo can be executed from the command line, like the Saxon example shown in figure 4-8, or can be included in another program, like the Xalan XSLT processor that has been incorporated into Cocoon. Unlike XSLT, XQueries are not expressed in XML, though they may contain XML components. There is an XML version of XQuery, called XQueryX, but it lags behind the development of XQuery. The fact that XQuery is not expressed in XML may be, in part, because XQuery was created as a query language for databases by people more familiar with SQL than XML’s SGML document roots. While many of the existing native XML databases use XPath as their query language, usually with an extension for full-text searching, many in the W3C and database communities believe that XPath lacks the ability to serve as a robust XML query language. Whether XQuery, or XQueryX, will succeed like XSLT remains to be seen. Two examples of an XQuery query are included below. The first example contains XML components; the second does not. let $i :=4 return let $r :=Value return
{$r} of 15*{$i} is {15*$i}.
let $x :=2 let $y :=3 return 20*$x+$y

164


In the first example, the two-line XQuery expression evaluates to: “
Value of 15*4 is 60.
.” The XQuery expression in the second example evaluates to 100. XQuery expressions, like SQL, support the FLWOR structure; FLWOR stands for “for,” “let,” “where,” “order by,” and “return.” Variables can be assigned by using “let,” which in this context means: “let a variable have a supplied value.” An example of this is the first XQuery expression above, where the “i” variable is assigned the value 4. Loops can be accomplished by using the “for” statement; a task can be repeated until a certain condition is met. Conditions can be tested using the “where” statement. The “order by” function allows for the reordering of query results based on a particular statement. XQuery’s “return” is used to indicate that control should be returned to the processor. In the second example above, the “return” indicates the end of an XQuery statement; the processor is informed that it can look for another statement or, if necessary, end and return the results. Kawa’s implementation of XQuery, like most of the XSLT processors, is reasonably flexible. There is a command line interface; alternatively, an XQuery expression can be evaluated at run-time from within another, user-written, application using a function that evaluates XQuery statements dynamically. In addition, Kawa can compile an XQuery expression into a Java program that takes an XML file as its argument. Perhaps the most interesting way to use the Qexo program, though, is to compile a Java servlet from the XQuery expression. Doing this allows a transformation to be performed from within a servlet container like the Apache Software Foundation’s Tomcat. The XML to be transformed can be read from a file or passed into the servlet as a parameter. This approach, compiling an XQuery statement into Java bytecode, is more appealing than evaluating an XSLT stylesheet and XML document “on the fly,” as many XSLT processors do. While XQuery’s authors created it as a database query language, it may also be used to transform XML. For example, Per Bothner, the author of Qexo, has demonstrated on his web page (Bothner 2002) that an XQuery script can be used to create an online photo album from XML sources. This type of application is something for which XSLT has traditionally been used; Bothner does it quite easily, however, using a simple XQuery script. If it seems to the reader that XQuery and XSLT might be used to accomplish the same tasks, there is good reason for this. With the advent of XQuery, many in the XSLT community have sought to better define the differences between XQuery and XSLT. On the XML-DEV electronic discussion list, a list where the best minds in the area of XML development often post, Evan Lenz, a software engineer at XYZFind, asserted that XQuery does not warrant a new specification; he believes that XSLT could be extended to handle the things that XQuery accomplishes (2001). Lenz also notes that the authors of the XSLT 2.0 and XPath 2.0 specifications have already planned to implement many of the features that make XQuery useful. Why, Lenz asks, should we have a different standard when the current one could be extended? On the other hand, those who argue that the next versions of the XSLT and XPath specifications will accomplish what XQuery already does need to consider whether people using XSLT will expand their mind-set and skill sets to include these new exten-


165

sions to the current XSLT standard. Perhaps a new standard, to clearly delineate the uses for each technology, would be beneficial. For the time being, the cautious will continue to rely on XSLT, the time-tested solution, and the pioneers will prefer to get their toes wet in the new W3C query language, XQuery.

XML BROWSERS: NOT JUST ANOTHER PRETTY FACE XML was created for the Web. For this reason, the most obvious browsers for XML are the existing World Wide Web (WWW) browsers. There are currently a wide variety of available WWW browsers, with various levels of support for XML. The most recent versions of any popular browser should have no trouble displaying a simple marked-up view of an XML document. Most of these will also display the stylized view of the document if there is an associated CSS stylesheet. Support for XML is incomplete in older browsers, however. As a result, many people producing XML for the Web choose to convert it into XHTML to render it into a format that any browser can read. The purpose of this section, though, is to discuss browsers whose XML support is more complete. The most widely used browsers are, of course, Internet Explorer, Netscape, and Mozilla. There is also a new browser from the Mozilla project named Phoenix. Gecko, the rendering engine used by Netscape, Mozilla, and Phoenix, provides the same level of support for all three browsers; the most recent version of Internet Explorer also has good support for XML-related technologies. Another browser worth mentioning is Amaya. Though Amaya is not used by a large number of people, it is worth mentioning because it has been developed by the W3C to preview new technologies. For this reason, its support of newer XML-related technologies, like XML SVG, is often better than that of the other, more popular, web browsers. Librarians interested in experimenting with new technologies may like Amaya; it also contains a powerful WYSIWYG editor that supports the editing and viewing of XML documents.

Mozilla The Gecko rendering engine that is used by Netscape, Mozilla, and Phoenix, the new lean web browser from the Mozilla project, is the best choice for an XML-aware browser. It is cross-platform, customizable (using XUL, an XML-based language that allows users to change the skin, or look, of their browsers), and fast. Perhaps more importantly, it is standards-compliant. Fortunately, it is also popular with library patrons. This means that promoting the use of the newest version should not be a difficult sell. Of course, a library needs to be prepared to support patrons using a variety of browsers, but we, as librarians, are also in a good position to recommend wise choices. Browsers based on Mozilla’s Gecko rendering engine are the “best of the breed” when it comes to XML-aware browsers.

166


Some may be wondering, “Didn’t Internet Explorer win the browser war?” Though Netscape suffered a temporary setback around the time that AOL bought the company, after a year of rewriting and renewing their focus, the Mozilla project, the project from which the core of the Netscape-branded browser is taken, is strong and is once again releasing quality products. The Mozilla browser has caught up to and in many cases even surpasses Internet Explorer’s support for XML and XML-related technologies. An interesting resource that illustrates this is Neil Deakin’s “101 Things That the Mozilla Browser Can Do That IE Cannot” web page (Deakin 2003). He makes a compelling case for preferring Mozilla. So which browser should a library choose? Though many libraries may prefer the speed of the Phoenix browser, not many patrons, at this point, will recognize its name, making it more difficult for patrons to find access to the World Wide Web from the desktop. Some library system administrators may also have concerns about putting a browser’s beta version on the desktop. Other options include Mozilla and Netscape. Mozilla’s name is gaining in popularity, but the most familiar version of the three Gecko-based browsers is, of course, the Netscape branded one. On the other hand, Netscape is often remembered as the browser that lost the World Wide Web browser war; name recognition is not always a good thing. One might suggest that we, as librarians, should be providing the best resources, regardless of name recognition, and teaching patrons how to use them. Those that subscribe to this opinion might want to install a stable, but still beta, version of the Mozilla project’s Phoenix browser and create an alias for it named “Web Browser.” Doing this provides patrons with a fast, simple browser without all the bells and whistles (e-mail client, HTML composer, etc.) that come packaged with Netscape, Mozilla, and Internet Explorer. Others, who prefer a more cautious approach, should look into Mozilla as a solution to providing access to XML and XHTML. So what features make Mozilla, and its siblings, the best choice for an XML browser? Standards compliance is why we recommend the Mozilla browser, or any browser based on Gecko. In the past, this was not always true. There was a time, before Netscape’s purchase by AOL, when Netscape and Internet Explorer were extending the HTML standard at an alarming rate. These proprietary extensions were designed to provide the latest, coolest features before the other browser did so. Unfortunately, the result of this was that different browsers implemented different features. Web designers would write web pages for one or the other browser, and users were often stuck with trying to read pages that required features not supported by their choice of web browser. Netscape’s financial difficulties brought an end to this competition. Many proclaimed that Internet Explorer had “won the browser war.” It had, at least, added more proprietary features than Netscape could. Times, however, have changed from the “there is no tomorrow” style of web development that was prevalent when the Web first became popular. Now people want to be able to view any web page in whichever browser they choose to use. They want the ability to choose what they feel is the better product without worrying about whether a web designer somewhere is writing pages for a browser he or she thinks is superior. They want standards and browsers that support them. The Mozilla project has, from the start, made a strong commitment to the


167

standards promoted by the W3C. Internet Explorer, on the other hand, is still interested in trying to maintain its market share by locking patrons into proprietary “features.” With the release of version 6 of Internet Explorer, standards compliance has improved. There still remains, however, a reluctance to develop cooperatively. Microsoft, as evidenced by its official stand on public standards, still values proprietary “features” more than community-derived standards. It says, in effect, “Standards are fine, but only if you adopt what we do first as the standard.” In its own words: “Microsoft believes very strongly in Internet standards and the standards process, and is committed to implementing appropriate standards when driven by customer demand. However, standards compliance is part of a larger effort that includes many constituencies. By innovating, and driving customer requirements into Internet Explorer and then into the standards groups, we’ll make the Internet a richer platform for all users” (St.Laurent 2000). By supporting and promoting a browser that is devoted to successfully implementing standards defined by community processes, librarians ensure that viewing web pages never again becomes problematic because of the proliferation of proprietary extensions to the community standard. Librarians have the choice to improve the quality of the Web by promoting a level of consistency, and adherence to community standards, that supports innovation and the wholesale improvement of our patrons’ web-browsing experiences. Fortunately, this choice does not require us to make any sacrifices; there is no bitter medicine. Choosing standards means choosing a fast, easyto-use, reliable browser based on the Gecko rendering engine.

Amaya The Amaya browser (W3C 2003) is not the web browser we at the Lane Library would choose to provide to our patrons, but librarians developing XML applications might like it. Its user interface is plain, but as a product of the W3C, it supports a number of XML features that other modern browsers do not (or at least not without the assistance of external, third-party plug-ins). Amaya also includes a powerful WYSIWYG editor. The most notable thing about the Amaya browser is its internal support for a number of new and experimental technologies. Like most XML browsers, Amaya supports XML, CSS, and HTML. Like some other XML-capable browsers, there is support for newer technologies like SVG, a Scalable Vector Graphics format. Unlike most other browsers, Amaya has built-in support for standards like MathML and Annotations. We should note that newer Mozilla versions support MathML; there is also a plug-in for Mozilla that gives it functionality similar to Amaya; Mozilla’s support for SVG, though, seems behind that of Amaya. At any rate, Amaya packages these technologies as a part of the browser; with Mozilla, separate bundles must be downloaded. MathML, which is very popular in the scientific community, is a markup language for expressing complex mathematical equations. This allows them to appear on web pages without having to be encapsulated in an image. Annotations is an XML/RDF standard that allows any person to annotate any web page, even a page that is not owned by the annotator. These annotations are saved between sessions of viewing

168


a web page. The idea is to enable people to create an individual or community bulletin board that can be used to annotate or discuss a website’s content. In figure 4-9, we see an example of the types of mathematical notation that are supported by the Amaya browser. Since MathML is represented in the same way that XHTML or XML SVG is, equations can be created, and even dynamically changed, by using the Document Object Model (DOM) mentioned in chapter 2. To assist in creating mathematical equations, and to prevent the average user from having to work with the DOM directly, Amaya also supports WYSIWYG editing of MathML. A special drop-down menu in the Types menu, seen in the menu bar of figure 4-9, provides access to Amaya’s special MathML features. Amaya makes a nice, easy-to-use editor for the creation of these pages. Changing or transforming a MathML equation is also easy to do using the Amaya browser. From the browser’s Edit menu, select the option to Change or Transform. These options work with both MathML and XHTML structures. Another feature that works with both

Figure 4-9 MathML Displayed in the Amaya Editor/Browser Copyright © [2003] World Wide Web Consortium (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All Rights Reserved. http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231


169

XHTML and MathML is the ability to use an XLink (discussed in chapter 2) to link between document components. This allows for a paragraph discussing a particular piece of a mathematical equation to point to that piece in the equation; particular pieces of equations can also be linked to their explanations, marked up in XHTML. In addition to MathML, Amaya has support for Annotations, the ability to annotate a web page and save that annotation for later review or publication. This is accomplished by using RDF and XML. RDF (Resource Description Framework) is a complex semantic layer implemented on top of XML; its purpose is to handle metadata in a uniform way. RDF, in theory, exists apart from XML and should be considered separately; XML is just the method through which RDF’s semantics are most commonly conveyed. Figure 4-10 illustrates some of the annotation options available using Amaya’s WYSIWYG editing interface. Annotations appear on the page as if they were a part of it, despite being stored in a separate file or annotation server.

Figure 4-10 Annotations Displayed in the Amaya Editor/Browser Copyright © [2003] World Wide Web Consortium (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All Rights Reserved. http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231

170


As mentioned previously, Amaya’s ability to record annotations is not limited to ones that are locally created and saved. There is the option to use an annotation server so that groups of people may annotate the same resources and share their annotations. This can be accomplished by using the Annotations configuration window in Amaya. In it, a remote annotation server can be configured. Once Amaya has been configured to use a remote annotation server, all annotations will be saved to the server rather than to the local machine. Just because Amaya can handle Annotations does not mean that an annotated page must display with all its annotations viewable. The default, in fact, is for a page’s annotations to not load automatically. In order to view a page’s annotations, the “Load Annotations” command must be called. If Amaya should always display annotations associated with a page, it is also possible to configure the browser to automatically load all annotations. Better yet, if some annotations should be displayed while others remain hidden, Amaya allows for annotation filters to be set, restricting the types of annotations that are displayed when viewing a page. Overall, the Annotations feature of the Amaya browser is extremely flexible. It works so well that some might want to use Amaya for this feature alone, relying on another XML browser to handle all other day-to-day viewing. Incorporating technologies like SVG, MathML, XLinks, and CSS into a reliable browser is an accomplishment. Amaya does it well and, as a result, is the W3C’s browser for implementing upcoming technologies. If more attention were given to improving the user interface, we wouldn’t be surprised to see Amaya contending for a position as a mainstream tool. Currently, Amaya makes a better editor than browser because of its basic interface. Other than its plain interface, our only real complaint is that links must be doubleclicked; this is unlike any other mainstream browser and would probably be an inconvenience to patrons familiar with the single-click style. The real strength of Amaya, in the authors’ opinion, is not in its use as a generic web browser, but in its use as a development tool and technology preview. Being able to create XML expressions that render as mathematical equations or to annotate a page and later view, or share, these annotations is what makes Amaya worth a look. For a dedicated, full-time web or XML browser, though, Mozilla is probably a better choice.

CONCLUSION Working with open-source software is not without its problems. Sometimes a new release has bugs that make it unbearable to use. Other times, a project that shows great potential will die out because the people involved could not spend time on it and were unsuccessful in attracting new developers, documenters, and managers. Many times, though, the community effort involved with such projects does create a product that is superior to others that cost thousands of dollars more. Open-source solutions are proof that “only the strong will survive”; if a project creates a product that others find useful, a small percentage of them will contribute, in a variety of ways, to that product,


171

strengthening the project and eventually attracting others. Projects that lack inspiration or that fail to produce a product that solves a real-world problem eventually die out. How does a library know if a product is worth local use? First, consider whether the product solves a real-world need in the library. If it does, but there are other opensource or commercial products which also satisfy that need, take a look at the developer community; viewing the electronic discussion list is usually the best way to do this. If the community has a lot of discussion and activity, the vitality of the product is strong and probably worth investigating further. If not, the product may still be useful to a library, but be aware that support for it may disappear at any time. Lastly, look at the organization under whose auspices the project is being developed. Projects developed by the Apache Software Foundation, the Free Software Foundation, and other large open-source or free software organizations are often well supported and fairly robust. These are unlikely to disappear without warning and, probably, much hue and cry. There are some, however, who believe that open-source software is too risky. These people may avoid using products like Apache, Linux, or MySQL and instead prefer commercial alternatives: Internet Information Services, Solaris, and Oracle. For people concerned about the longevity of open-source projects, there are also many commercial XML editors, transformers, and browsers that might suit their needs. At the Lane Library we have used some of these for short periods of time, but are more familiar with open-source alternatives. For more detail on each of these, visit the XMLSoftware.com website (Tauber and van den Brink 2003). Perhaps the foremost commercial editor and integrated development environment for XML is XMLSpy. We used it for a year or two before its inability to process large documents efficiently forced us to switch to other editors. These problems have most likely been fixed in the current version, since support for the product is good. XMLSpy’s element-complete function was beneficial in saving many keystrokes. Support for the guided creation of XML documents in XMLSpy is also very good. One downside of using XMLSpy is that it only supports the Windows operating system. Unix, Linux, or Macintosh machines are not supported. In the realm of proprietary XSLT processors, Oracle’s are perhaps the most widely used. Probably because there are so many other fine, and free, XSLT engines, Oracle offers its XML tools for free. The tools, of course, are not Oracle’s primary business, so the company does not lose anything in the process. Of the commercially available tools, Oracle’s are very popular and often appear on XML benchmark sites, an indication that people are using them and want to know how well they perform. We at the Lane Library have also used Oracle’s tools for a short period of time. They are, like the open-source alternatives, easy to use, but unlike the open-source alternatives, they include special hooks into the Oracle application server. These extra features might be beneficial to libraries that are developing for the Oracle database exclusively. Using proprietary features, though, does result in vendor lock-in, something we avoid by using open-source tools that support Oracle, SQL, DB2, MySQL, and PostgreSQL databases. The tools covered in this chapter are not the only solutions, nor are the categories used to group them the only categories of XML software from which to choose. There

172


are XML editors that are specifically designed to edit and view XSLT documents; there are transformation applications that are designed to specifically transform the structure of an XML schema into a graphic representation; there are specific tools for almost every purpose. The categories used in this chapter are only the most basic ones; they are intended to guide the XML beginner to what we believe is the most appropriate set of XML-related tools. In addition, there are many XML applications, even some in this chapter, which blur the lines between the categories; they often act as integrated XML development environments. With these tools, a document can be edited, transformed, converted into a PDF, and posted on a web page with relatively few steps. Hopefully, this chapter has laid the foundation for further exploration of the many XML-related tools that are available. To get started with XML, the three most basic tools required are a text editor, an XSLT transformation engine (to turn XML into something patron-friendly), and a web browser to view the result of the transformation. Once these basic tools have been mastered, a librarian may move on to some of the more exotic XML tools, the kind used to work with the cutting edge, and future trends, of XML development.

The Future Is Now Trends and Possibilities

ML offers libraries the strategic opportunity to implement the “library of tomorrow” today. By pragmatically using XML in our day-to-day activities, we take small steps toward improving the services that libraries provide. It is also important, in the authors’ opinion, to keep abreast of the latest developments in the XML world. Many of these advances were initiated by others who, out of necessity, solved problems that we too might face. Balancing the need to satisfy our daily responsibilities with the desire to improve the overall service that libraries provide is an important activity, one which requires patience and practical experience. This chapter attempts to integrate these two approaches to digital librarianship by looking at some existing, library-specific XML projects and some potential XML applications for the libraries of the future. It is not intended as a comprehensive overview of what libraries are currently doing with XML. For this, we recommend Roy Tennant’s exceptional book, XML in Libraries, published in 2002. Instead, this chapter presents a few examples of what we at the Lane Medical Library are doing with XML. In addition, several new XML-related technologies that may be important in the future are covered. By learning about them today, we believe, librarians can make themselves indispensable to the information seekers of tomorrow.

X

TRENDS AND FUTURE STANDARDS New XML applications continue to emerge, allowing more sophistication and flexibility in the management of library content. Examples include XInclude, XForms, SVG, DocBook, and VoiceXML. Most of these are currently W3C recommendations, but those that are not are de facto standards because of their widespread use. Whether any of these will become as popular as XSLT remains to be seen, but several are poised to make significant changes in the way we process data. This section attempts to briefly highlight these new XML technologies. 173

174

The Future Is Now: Trends and Possibilities

XInclude XInclude provides a processing model and syntax for building large XML documents from smaller ones. This is similar, in concept, to Server Side Includes. The “inclusions” (i.e., the smaller constituents) may be complete XML documents, wellformed fragments of XML, or even non-XML text documents, such as the source code for a Java program. Each constituent or external document is identified by an “include” element from the XInclude namespace. The include element contains an “href” attribute, also identified by the XInclude namespace. This attribute identifies the URL associated with the included file: This big document is broken into pieces . . .

XInclude is recursive; each included file may include other files as long as loops are avoided. Using XPointers, it is also possible to include parts of a document for inclusion in another. This is unlike Server Side Includes, which incorporate complete documents. At this time, there are a number of XInclude implementations. While XInclude functions, in a limited way, like XLink, XInclude has enjoyed more popularity to date. There are several implementations of the XInclude standard, but few complete XLink implementations. This is probably in part because the XLink standard is much more complex than the XInclude one. Even if the current XLink standard does not get widely implemented, XInclude allows some of XLink’s functionality to be implemented today; in the future, we expect this simple approach to internal linking to continue to grow in popularity.

XForms In November 2002 the W3C released XForms version 1.0, enabling forms on the Web to be represented in XML. The main benefit of this release is that it separates the purpose, presentation, and results of a form on a web page. This makes working with, and managing, user-submitted data easier. Though there are many implementations of the XForms standard, not many web designers are currently using them on web pages. This may be in part because of the complexity of the standard. The authors expect that some sort of XML-based web form (a set of fill-in-the-box static or dynamic prompts that, when saved, creates an XML document) will eventually be widespread; it remains to be seen whether the current XForms standard is the one. Using XML to encapsulate form information is not a new idea. Before XForms became a recommendation, Nanoworks released an XML-based forms package that generates and validates interactive web forms. Unlike most current XForms implementations, the Nanoworks package works with existing JavaScript-capable browsers.


175

The XForms package, which is available on the Web (Smith 2000), is freely available under the conditions of the Free Software Foundation’s Lesser General Public License. Smith’s web page also has an interactive demonstration of the form library in action. Like the W3C’s XForms, Nanoworks’s XForm package uses XML to separate the process of managing web information on the Web. Unlike XForms, Nanoworks’s XML forms are displayed in HTML, making them accessible to any browser. Future support for styling a form using CSS and XSL is in the works. An example of an XForm form is given below. An Example Form
Required fields are indicated with an asterisk* Please complete or correct fields marked in red

Choose One Staff Faculty Student

176


The development of various XML-based form standards will continue. Implementations for the W3C’s XForms exist in Java, Flash, and a variety of other languages. Once a final standard is reached, all web browsers will probably support XML-based web forms. If this happens, XML will simplify the task of processing patron data submitted from a web interface.

Scalable Vector Graphics (SVG) XML’s flexibility also extends to graphics. Scalable Vector Graphics 1.1 became a W3C recommendation in January 2003; version 1.0 had been approved in September 2001. SVG provides a language for describing a two-dimensional vector, or mixed vector/raster, graphic. This makes it possible to describe graphics using XML elements and attributes. The resulting images are scalable in that they may be increased or decreased in size without distortion. Instead of defining every pixel, vectors describe geometric objects such as lines and curves. In addition, TIFF, GIF, and JPEG images may be included in the display by linking a raster, or pixel-based graphic, into the SVG image. When an SVG image is displayed in a web browser, either natively or with the assistance of a plug-in, it and any linked raster images are displayed pixel by pixel, just like they would be if the image was a regular raster graphic. The XML markup that follows completely describes the image in figure 5-1, although color is not reproduced. Note that the text included in the image is searchable like any other XML markup. The comments in this markup, created by Charles Yates, indicate the different components of the markup.

Figure 5-1 Black-and-White Version of SVG Image Copyright © Lane Medical Library, Stanford University


177

LANE MEDICAL LIBRARY LANE MEDICAL LIBRARY School of Medicine Medical Center Stanford University

178


Given this verbosity, why use SVG? Unlike raster images, such as GIF and JPEG, Scalable Vector Graphics images are scalable, text-based, zoomable (portions can be zoomed into), and can be searched with an ordinary search engine. SVG images are also scriptable. This means they can change dynamically based on user input or events that happen as a patron interacts with a web page. SVG also has the advantage, unlike similar formats such as Macromedia’s Flash, of being a W3C standard. SVG images hold promise for XML on the Web because most of today’s web graphics can be represented in the format. This opens the door for better image retrieval, because search engines would not be dependent on file names or related pages to describe the content of an SVG image. Perhaps, in the future, maps of the library or the organization of the stacks will be displayed in SVG. For this to happen, most web browsers would need to implement the standard either through the use of a plug-in or native programming code.


179

DocBook There are many document-centric XML standards. Each serves a different purpose: EAD (Encoded Archival Description) describes archival materials; TEI (Text Encoding Initiative) marks up digital text, and is used by many major universities for their digital library initiatives; OeB (Open eBook) is a standard for the online publishing industry. Another document-centric XML standard that has benefited from wide use is DocBook. DocBook (OASIS 2003b) is an XML standard used to validate the creation and editing of technical manuals. This is important for libraries because much of our documentation is stored in a variety of formats. Some of it may be on the Web in HTML/ XHTML, some of it may be in proprietary Microsoft Word formats, some may be in plain text files, and some may be on small sheets of paper taped to our monitors. Having a standardized documentation format means there is less variety to maintain. Using the DocBook format could save a library time and money. Some readers may be saying, “We have our documentation in a variety of formats for a reason.” Some of it needs to be on the Web, some of it needs to exist in our local file systems, and some needs to be taped to our monitors. This is fine. DocBook does not preclude this. Since DocBook is an XML format, it is easy to transform a DocBook document into XHTML using an XSLT stylesheet; if a PDF is needed to print a handout for patrons, DocBook documents can also be converted into PDFs. Maintaining only one source format simplifies the maintenance process. Transforming the source is easy because there are already a number of stylesheets written to convert DocBook documents into other formats (OASIS 2003b). DocBook was originally created as an SGML application, but as XML grew in popularity, so did the need for a standardized documentation format. An XML version of DocBook was created from the SGML version for this purpose. In addition, a simplified DocBook DTD was created to ease the learning curve associated with the de facto standard. To ensure compatibility, simplified DocBooks may also be validated using the full DocBook DTD. DocBook is probably the most widely used XML documentation format today. Since the DocBook format might be too difficult for some library staff to learn, libraries interested in the XML version of DocBook might want to look at the XMLMind Editor. This editor is a WYSIWYG editor that allows an XML document to be edited just like one would edit a Microsoft Word document. This is possible because one of the XMLMind Editor’s default XML formats is the DocBook format. The XMLMind Editor simply uses a built-in stylesheet to display the document in a userfriendly form. Those interested in investigating this editor further should download the free version from the XMLMind website; there is also a commercial version with additional features available (XMLMind 2003).

VoiceXML Having a public access catalog read aloud the results of a search may seem like the stuff of science fiction. However, one does not need expensive software to realize this

180


vision. One way it may be accomplished is through the use of VoiceXML, a markup language for the spoken word. A VoiceXML document may be read aloud by one of many freely available VoiceXML processors. Many of these may be found on the W3C’s VoiceXML page (W3C 2003b). VoiceXML could potentially enable patrons to access the library website and catalog/circulation system from any phone, and enable libraries to better support those with visual impairments and those who need access to the computer while their hands and eyes are busy. Like many of the technologies in this chapter, VoiceXML is not widely used. However, there are software applications for it. Development continues, and we believe that in the future VoiceXML may enable librarians to provide a higher level of service than they currently do. Since this technology is still in its infancy, it is likely that the work the W3C does will blaze the trail for future research and practical applications. It is also worth noting that the W3C has met some resistance among the developers of the VoiceXML standard. Since the committee to create VoiceXML was formed under an older agreement that did not sign patents over to the W3C, VoiceXML has been delayed by individual corporations that want to cash in on what they believe are their intellectual property rights. Whether this will stop the release and standardization of VoiceXML remains to be seen. For now, librarians should keep an eye out and watch the development of VoiceXML, and its related standards, on the W3C website (W3C 2003b).

OpenOffice, AbiWord, and Microsoft Word One of the most interesting trends in the XML world is the shift from proprietary text-processing formats to XML-based ones. The latest announcement from Microsoft is that Microsoft Word will switch to XML in its next version (Microsoft 2002). While moving to XML is not that innovative (other word processors like OpenOffice and AbiWord have been using XML as a storage format for a while), it does mean there will soon be millions of new XML users. Both AbiWord and OpenOffice have had their own XML formats for some time. Both of these products, and the Microsoft Word product, use different schemas to mark up what word-processing programs do to documents. Add this to the already popular DocBook format and there are a wide variety of schemas available. Though XML is more flexible than other proprietary formats, this still means stylesheets must be written to convert from one format to another. An alternative to this would be to do what web browsers have done: start working toward a word-processing standard. If we had a standard, which word processor a person selected could be based on which is better rather than which outputs a particular format that one needs. Though OpenOffice, AbiWord, and Microsoft Word all support similar formats, a shared XML format would improve communication between people who use different word processors; a shared format would also prevent vendor “lock-in” for libraries. Toward this goal, OASIS has started work on a standard format for word processors. While this is certainly a worthwhile ambition, in order for it to work there must


181

be cooperation between all the major players. Getting Microsoft to work on a standard format might be difficult. At this point, the company has declined to participate in the standards process, stating instead that since Office 11 will support XML Schema, anything that the OASIS committee comes up with will also be supported by the new Microsoft product. Unfortunately, it seems that the new capability of Word to handle user-defined schemas will be limited to its pricey enterprise edition. Most Word users will not be able to use the new flexible word-processing features.

XML POSSIBILITIES The Lane Medical Library uses XML for a variety of things. Our library website is created using Cocoon. We generate an online serials list from our catalog using XML; we update our PubMed “LinkOut” titles using XML; and we maintain our catalog with MARCUTL, an XML-based MARC update and transformation language. We are also working to convert our MARC records into XML using XOBIS. While there is not enough space to describe all the XML strategies we employ at Lane, we hope the ones described here will illustrate what can be done with XML.

Transitional E-Journals List Like many scientific and medical libraries, Lane Medical Library maintains a list of its serials (currently over 14,500 titles). From 1985 to 1990, an annual printed list was produced from MARC bibliographic records, with considerable custom programming effort. This was abandoned due to various factors, but primarily because the online catalog was widely available and more up-to-date. With the rise of digital content, the library undertook to produce a list of digital periodical titles using HTML. As this grew to over 1,400 titles, it became increasingly difficult to produce; maintaining the list also duplicated technical services’ effort, and was a challenge to synchronize with the online catalog. In 2002, as part of an overall strategy to modularize the management of our web resources, the Lane Library website was converted to XML using the Cocoon framework. To address the problem of currency, the library’s new e-journals list is generated into XHTML at approximately fifteen-minute intervals, as traffic dictates. Until bibliographic, authority, and holdings data can be converted into XML, we extract MARC records, convert them to XML, and transform them in the way that is roughly outlined below. The sequence of steps is controlled by another XML document. • An SQL query to the Oracle database retrieves bibliographic and associated holdings records. • MARC4J (a Java, event-based interface to MARC records) parses the records, creating Java objects. • These Java objects are converted to an XML document, with an root element, and held in the server’s memory.

182


• A click on the “E-Journals” web page link passes a selected letter of the alphabet as a parameter (the default is “a”) and triggers the update sequence above to “refresh” data in memory at a maximum rate of every fifteen minutes. • An XSLT stylesheet uses parameters passed to transform the document into a document. • XInclude adds a boilerplate, and an XSLT stylesheet further processes the with common content for the website (using namespaces); this produces an XHTML document for display. • A linked CSS stylesheet controls appearance of the transient document. The Lane Library adopted a single record policy for serials after a few months of creating separate records for each digital version; the redundancy was confusing for both staff and users. In addition to print holdings, each digital version has a separate holdings record. Data unique to a specific version, such as aggregator, variant title, and form/genre terms, is maintained in the holdings record for that version. As a temporary measure, selected fields are automatically mapped to the bibliographic record, using the XML-based MARC update and transformation language, MARCUTL. This ensures that information is available for indexing. For the example below, information is derived from the following MARC fields and subfields. The title is extracted from the bibliographic record’s 245 field; we use subfields “a,” “n,” and “p.” Holdings information is extracted from the holdings record. We use subfields “v,” “y,” and “z” of the 866 field. Finally, version information is extracted from the holdings’ 844 field’s “a” subfield. The following example shows how one title with three versions is displayed. Science (v. 271, 1996-) (HighWire) Science (v. 1, 1895-) (JSTOR) Science (v. 259, 1993-) (Ovid) For a full example, visit the serials list (Lane 2003). An excerpt of the page markup underlying the above example is reproduced below. For more details, visit the website and choose “View Source” from within the web browser.
Science (v. 271, 1996-) (HighWire)

Science (v. 1, 1895-) (JSTOR)

Science (v. 259, 1993-) (Ovid)

The following XML fragment illustrates the ad hoc elements used in one of the earlier processing steps. Eventually, these elements will be replaced by XOBIS elements, e.g., JSTOR would be designated a version and would not need to be mapped using temporary elements. When this happens, punctuation will not need to be stripped


183

before the content is displayed. An example of the transitional XML format we are using is as follows: Science [print/digital]. v. 271- = 1996- HighWire http://www.sciencemag.org/ Science [print/digital]. v. 1- = 1895- JSTOR http://www.jstor.org/journals/00368075.html Science [print/digital]. v. 259- = 1993- Ovid http://lml.stanford.edu/cgi-bin/ovid?100007529

Lane’s transitional procedure works well, but will be simplified when the serial content originates as XML from a XOBIS database. Once converted to XML, the data can be readily “repurposed,” in this case by using namespaces to integrate it with other XML data from the web page, where it is rendered using stylesheets. This flexibility is also based in part on the library’s data policies, e.g., naming versions. Though Lane’s serials list is created using a transitional strategy, we have been very pleased with the results.

Updating MARC with MARCUTL As mentioned in the preceding subsection, the Lane Library processes all its MARC records to assist with the retrieval of information that sometimes gets buried in, or omitted from, the MARC format. We do this with the assistance of the XMLMARC program and MARCUTL, an update and transformation language. Both the program and associated XML-based mapping language were created at Lane and are available for download for free from the Medlane website (Lane 2002). This subsection does not attempt to explain the program or mapping language in depth, but merely highlights how Lane is using both to simplify its cataloging procedures. For a more detailed treatment, see Clarke (2002).

184


The XMLMARC program, a program that converts MARC into XML and enhances MARC records based on instructions in a MARCUTL file, has been used at Lane for several years. During this time it has undergone a number of changes. The program started as a student project to enable the library to convert its MARC records into XML. After this, the program was enhanced so that it could update MARC records based on a simple set of instructions described by an XML file; these instructions were the first iteration of the MARCUTL language. However, this early version of MARCUTL only permitted MARC-to-MARC updates. Later versions once again added the MARC-toXML capabilities that were present in the original XMLMARC program. Prior to these later versions, files conforming to the MARCUTL “standard” were validated using a MARCUTL DTD. Since the early versions that supported the first iteration of the MARCUTL language, XMLMARC has been rewritten from its original XML Document Object Model format to a model that binds XML elements and attributes to Java MARC/MARCUTL objects. The new version of the program, and the enhanced MARCUTL language, now validated using a RELAX NG schema, are scheduled to be released in 2003. Like the original, the XMLMARC program, and its new MARCUTL RELAX NG schema, will be available for free from the Medlane website (Lane 2002). Once they are released, the website will contain links to documentation more current than that published by Clarke, though the latter may still be useful to those interested in learning about the process involved with updating MARC records using MARCUTL and XMLMARC. The Lane Library makes a number of automatic enhancements to its catalog records. Often, these enhancements are made based on the occurrence of certain conditions within the MARC record; other times, they are things we add to all newly imported records. Though catalogers could make these changes, and do in other libraries, the types of changes that the program makes do not really require human intervention. For these routine updates, the XMLMARC program works fine, saving the time of the catalogers and freeing them to solve problems that do require human intervention. Another advantage to updating MARC with MARCUTL is that ongoing changes to the cataloging process do not require a programmer to write new code or modify existing programs and procedures. By using a plain-text XML file, catalogers can make changes themselves without having to schedule the time of a systems librarian. Some of the changes that we make to our records are not supported by the current Library of Congress MARC standard. We do this to improve access to information that is often hidden in the fixed fields and inaccessible to patrons through keyword searching. To make our MARC records conform to the national standards, MARC records are changed as we export them from our system. Some types of changes we make include moving information from the holdings record, where it is maintained, into the bibliographic record, where it is searchable; adding geographic relationships to records based on the occurrence of geographic subfields in subject headings; and removing MARC fields generated by previous processing (we run new record processing every fifteen minutes and update processing once a day). Reproduced below is one example of the type of updates we make to a record. This is a very simple example that adds a new 035 subfield 9, a local field indicating that the


185

record has been processed by the program. The content of the 035 subfield 9 in Lane records is the Lane control number. If the record was imported into our current integrated library system from the old system, it will already contain a 035 subfield 9 with a Lane control number. If the record has been created in the new system, the XMLMARC program assigns a unique Lane control number based on the 001 field and the type of record. Lane does this so that we have a unique control number that does not depend on our current vendor’s 001 field. We link to these numbers and, as a result, are not “locked in” to a particular vendor because of a dependence on its numbering scheme. L Q

The MARCUTL file from which this excerpt comes has a filter above this section that limits this change to bibliographic records that are new (e.g., that don’t already have an 035 field). In this match, a MARC record is tested to see whether it matches the pattern. Since all new records will have a byte 7 in the leader (we know this is the leader

186


because of a field element that appears above the excerpt in the file) and a 001 field, all records will receive the processing specified by this part of the map. In the example above, a new MARC field and subfield are created by the “burst” element. This element tells the program to compare each case value against the “pattern” element represented by the XPath-like expression that is an attribute of the burst element. If the program finds a match, a new field with the characteristics described in the particular case is created. Contents for the new field can either be supplied as text in the update file or can be from the MARC record. If they are from the MARC record, they are extracted using the XPath-like expressions treated as attributes of the “source” element. The Lane Library is actively developing XMLMARC. We also encourage other libraries to download it and see if it suits their needs. It is currently released under a free software (open-source) license that permits modifications to be made. We also would encourage people to report problems with the program to us. For questions on how to use the program, or for more details about what it does, feel free to contact the XMLMARC mailing list (Lane 2003b).

Maintaining PubMed LinkOuts Like many medical libraries, the Lane Library uses PubMed to locate journal articles. Of particular significance to our patrons is an Entrez feature that enables them to find out whether an article they want is held in a journal to which the Lane Library provides access, either digitally or through a physical copy. This service is called LinkOut. The National Library of Medicine (NLM), the entity that manages PubMed, has been on the leading edge of XML innovation for some time. Recognizing the importance of the service, Lane was one of its initial beta testers; we have been using the service since its inception and are very pleased with its results. LinkOut works by storing a list of journals and metadata about the journals in an XML file. The NLM uses this file to create links within the Entrez system. Using this feature, a patron can perform a search, find a relevant journal article, and link right to it through the PubMed interface. While LinkOut was initially used for online journals only, it has been expanded to include print journal holdings too. LinkOut is mentioned here because it uses XML as its data transmission format; it is one of Lane’s most popular services. The Lane Library started talking with the NLM about LinkOut in April 1999. Our first demonstration LinkOut page was created by Pam Murnane in September of that year. At the end of that month, she demonstrated our LinkOut page to the NLM Board of Regents. Lane, like most libraries that participate in the LinkOut service, maintains its LinkOut files by hand. We are currently investigating generating LinkOut files automatically using information found in our catalog and in a list of journals and providers that is available from the National Library of Medicine (NLM 2003). Since LinkOut records are currently maintained by hand, we use an online validator that informs us whether our LinkOut file is well-formed and valid. This is maintained by the National Library of Medicine; we never have to worry that the DTD we


187

are using is no longer current. Since the DTD for LinkOut records is available online (NLM 2002), it is also possible to edit LinkOut files in an XML editor like the ones discussed in chapter 4 of this book. We have been satisfied with the NLM-provided online validators; when we switch to the automatically generated LinkOut file, we expect to guarantee that our XML is valid and well-formed programmatically. In the meantime, we will continue to edit our LinkOut files by hand. Reproduced below is an excerpt from our current file. The first thing one might notice is that there is an entity reference that represents a GIF image. This image is used as the Stanford icon in the PubMed display. When a journal article can be reached by Lane patrons, the icon is displayed as a clickable image. Clicking on the icon, through the magic of LinkOut, connects the patron to the article requested. This is negotiated using the information provided in the LinkOut file; the following is an excerpt from that file. AAPS 3140 &icon; PubMed AAPS “AAPS PharmSci”[jour] /entrez/utils/pmliblink.cgi?id=&lo.id;&lib=stanford AcadPres 3140 &icon; PubMed AcadPres “Genomics”[jour] “J Mol Biol”[jour] “Neuroimage”[jour]

188

The Future Is Now: Trends and Possibilities /entrez/utils/pmliblink.cgi?id=&lo.id;&lib=stanford

LinkOut is a good example of how using XML as a transmission format can simplify the work needed by an individual library. By establishing an XML schema, in this case, a DTD, the National Library of Medicine has created a simple yet powerful language for describing how materials should be linked. Since the transmission data is marked up in XML, a generic XML validator is all that is needed; Lane edits our files in Notepad and then checks them online in the validator provided by the NLM. Once we automate this process, something that would be a lot more difficult without the ability to work with a simple format like XML, we will be providing this service with very little maintenance cost to the library. We expect there will still need to be some human interaction in the process, but much of the routine work can be delegated to a computer program. By letting humans do what machines cannot do and machines do what humans do not need to do, XML makes library work more efficient.

MARC to XOBIS in 2003 and Beyond The Lane Medical Library has devoted considerable effort to creating an XML schema for bibliographic and authority data. The process involved a detailed study of the other library-related XML schemas and a thorough look at the MARC format. In 2003, we expect to convert all our MARC records into XML for further experimentation. This section details those plans. The switch from MARC to XOBIS will be made using the XMLMARC program. We have, in the past, converted all of our catalog records into XML, but we did this using a DTD that was very close in its composition to MARC. Efforts like this are not unusual in the library community. The Library of Congress is currently promoting an XML “standard” for MARC records called MARCXML (LC Network 2003). This standard, like the original Medlane DTD, is a literal representation of the MARC format. Converting our records into XML will require that we create a MARCUTL map file to describe the types of transformations that should take place. Since converting MARC into XOBIS is not a one-to-one conversion, we are unsure whether changes will need to be made to the mapping language. In its current form, the map should be able to handle much of the transformation process, but some adjustments may be required. An excerpt from a MARCUTL map that is intended to convert MARC into MODS, a bibliographic format discussed in chapter 3, is reproduced below.


189

It is easy to see how basic transformations take place. If a MARC record has a 245 field with any of the subfields that appear in the pattern, it is considered a match. In the first match, two elements are created. The first, an element named titleInfo, is a child of a previously created mods element; the second, an element named title, is created as a child of the titleInfo element. It is the responsibility of the map’s creator to make sure the hierarchy created by the elements is logical. In the future, an XSLT stylesheet might be used to enforce the logical structure of the resulting XML document. MARCUTL’s MARC-to-XML mapping uses the same pattern referencing as the MARC-updating part of the language. By referencing the pattern through XPath-like strings, MARCUTL tells the XMLMARC program to access that part of the record represented by the pattern reference. Parts of a MARC record may be concatenated with other parts or with text supplied in the mapping file. Once we have completed the mapping from MARC into XOBIS, we plan on loading our new records into an eXist database (Meier 2003). We chose eXist for several reasons. The first is that it integrates well with Cocoon, the web application framework we have chosen to host our library website. The second is that it implements the XMLDB API (XML Database Application Programming Interface). The third is that it is a native XML database; this makes it easier to start experimenting, since data does not need to be normalized before it can be stored in the database. The staff at Lane also plan to implement XOBIS using a relational database. Since this will take more development time, the implementation of a native XML data store is our first priority. Storing XOBIS records in a relational database should be much easier than storing MARC data. In practice, MARC data is often duplicated in a relational database because of the enormous complexity involved with normalizing it.

190


Since XOBIS is, in essence, already “normalized,” mapping XOBIS to a relational database should be easier. After the database is built, the Lane Library will then work to improve patron access to our data and demonstrate why we believe XML offers librarians a strategic opportunity.

CONCLUSION XML provides a generalized structure for approaching web-oriented information management and a means for solving complex problems using its relatively simple building blocks. This simplicity also makes XML accessible to beginners. To get started, one only needs to know the rules of well-formedness and have access to a plain text editor. Working with XML can get much more complicated, but one does not need to know everything about XML and its related technologies in order to use it effectively. If time is limited, consider learning just one other XML standard, XSLT or CSS. Either of these, partnered with well-formed XML, constitutes a toolkit capable of reshaping the way we work with library information. XML affords librarians the opportunity to evaluate what works well in their local library environment, to participate in developing better standards for digital information, and to learn valuable, transferable skills for the future. Creating an XML schema for reference handouts, or for an online serials list, for instance, helps frame common problems in a way that causes us to reconsider the line between content and presentation, between product and service. If we think there is no room for improvement in the services we provide, it only means we are unaware that we could be doing better. XML’s presence in the library will continue to grow. Librarians will find innovative yet pragmatic uses for XML. In the end, remember that XML does not do anything; it is just a tool. What we choose to do with it will depend on the challenges we face and our ability to envision future systems. Roy Tennant (2002b) has called XML the “digital library hammer”; the authors suggest that XML is a tool for traditional libraries as well. Hopefully, this book has made you think about innovative ways to use XML in your library. If you find some nails well suited to XML, please consider joining the XML4Lib electronic discussion list (XML4Lib 2001– ) and share your solutions with the community.

References

Apache (Apache Software Foundation). 2002. “Introducing Cocoon.” At http://xml.apache.org/cocoon/ introduction.html (accessed 2 February 2003). Apache (Apache Software Foundation). 2003. “Apache HTTP Server Project.” At http://httpd.apache.org/ (accessed 2 February 2003). Apache (Apache Software Foundation). 2003b. PHP [website]. At http://www. php.net/ (accessed 28 January 2003). Bean, Carol A., and Rebecca Green, eds. 2001. Relationships in the Organization of Knowledge. New York: Kluwer Academic Publishers. BitFlux. 2003. “BitFlux Editor: A WYSIWYG XML Editor for Any Operating System.” At http://www. bitfluxeditor.org (accessed 2 February 2003). Bothner, Per. 2002. “Per and Nathan’s Photo Gallery.” At http://pics.bothner.com/ (accessed 2 February 2003). Bothner, Per. 2003. “Kawa, the Java-Based Scheme System.” At http://www.gnu. org/software/kawa/ (accessed 2 February 2003). CIDOC (International Council of Museums. International Committee for Documentation). 2003. “CIDOC Conceptual Reference Model.” At http://cidoc.ics. forth.gr (accessed 17 April 2003). Cladonia Ltd. 2003. “XML eXchaNGeR.” At http://www.xngr.org (accessed 2 February 2003).

Clarke, Kevin S. 2000. “Open Source Software and the Library Community.” At http://ils.unc.edu/MSpapers/2576.pdf (accessed 28 January 2003). Clarke, Kevin S. 2002. “Updating MARC Records with XMLMARC.” In XML in Libraries, edited by Roy Tennant. New York: Neil-Schuman. Deakin, Neil. 2003. “101 Things That the Mozilla Browser Can Do That IE Cannot.” At http://www.xulplanet.com/ ndeakin/arts/reasons.html (accessed 2 February 2003). Delsey, Tom. 2002. “Functional Analysis of the MARC 21 Bibliographic and Holdings Formats.” At http://www.loc. gov/marc/marc-functional-analysis/ home.html (accessed 28 January 2003). Includes clarification of the relationships between data structures embodied in the MARC formats and the FRBR and AACR models. Demany, Didier. 2003. “XMLOperator: An XML Editor.” At http://www. xmloperator.net (accessed 2 February 2003). EAC (Encoded Archival Context). 2003. At http://www.library.yale.edu/eac/ (accessed 17 April 2003). EAD (Encoded Archival Description). 2002. At http://www.loc.gov/ead/ (accessed 17 April 2003). EDItEUR. 2002. EDItEUR [website]. At http://www.editeur.org/ (accessed 28 January 2003). Includes the ONIX schemas for the book and serials industries.

191

192

References

Forum for Metadata Schema Implementers. 2002. “Schemas.” At http://www. schemas-forum.org/metadata-watch/ d29/d29.htm (accessed 30 January 2003). FSF (Free Software Foundation). 2002. “Free Software Definition.” At http:// www.fsf.org/philosophy/free-sw.html (accessed 2 February 2003). Garshol, Lars Marius. 2003. “Free XML Tools and Software.” At http://www. garshol.priv.no/download/xmltools/ (accessed 2 February 2003). Guenther, Rebecca, and Sally McCallum. 2003. “New Metadata Standards for Digital Resources: MODS and METS.” Bulletin of the American Society for Information Science 29, no. 2:12–15. Hofstadter, Douglas R. 1979. Gödel, Escher, Bach: An Eternal Golden Braid. New York: Vintage Books. IANA (Internet Assigned Numbers Authority). 2002. “IANA Home Page.” At http://www.iana.org/ (accessed 28 January 2003). IFLA (International Federation of Library Associations and Institutions). 1998. Functional Requirements for Bibliographic Records. At http://www.ifla.org/ VII/s13/frbr/frbr.htm (accessed 28 January 2003). JSC (Joint Steering Committee for the Revision of Anglo-American Cataloguing Rules). 1998–99. “The Logical Structure of the Anglo-American Cataloguing Rules.” At http://www. nlc-bnc.ca/jsc/docs.html (accessed 28 January 2003). Kay, Michael. 2002. “SAXON: The XSLT Processor.” At http://saxon. sourceforge.net/ (accessed 28 January 2003). Kuhn, Markus. 2001. “A Summary of the International Standard Date and Time Notation.” At http://www.cl.cam.ac. uk/~mgk25/iso-time.html (accessed 1 January 2003).

Kuhn, Thomas S. 1996. The Structure of Scientific Revolutions. 3d ed. Chicago: University of Chicago Press. Lam, K. T. 2001. “Moving from MARC to XML.” At http://ihome.ust.hk/~lblkt/ xml/marc2xml.html (accessed 30 January 2003). Created 21 July 1998. Part 2 covers the handling of multi-script metadata. Lane (Lane Medical Library, Stanford University). 2002. “The Medlane Project: Overview.” At http://medlane. stanford.edu (accessed 28 January 2003). Lane (Lane Medical Library, Stanford University). 2002b. “XOBIS: The XML Organic Bibliographic Information Schema.” At http://xobis.stanford.edu (accessed 28 January 2003). Lane (Lane Medical Library, Stanford University). 2003. “E-Journals: Lane Medical Library, Stanford University Medical Center.” At http://lane. stanford.edu/online/ej.html (accessed 28 January 2003). Lane (Lane Medical Library, Stanford University). 2003b. “MedlaneXMLMARC Info Page.” At http://lane. stanford.edu/online/ej.html (accessed 28 January 2003). LC (Library of Congress). 2000. “MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Character Sets. Part 2, UCS/Unicode Environment.” At http://lcweb.loc.gov/ marc/specifications/speccharucs.html (accessed 28 December 2002). LC Network (Library of Congress, Network Development and MARC Standards Office). 2002. “MARC Code List for Relators, Sources, Description Conventions.” At http://www.loc.gov/ marc/relators/relahome.html (accessed 28 January 2003). Note especially part 1: Relator codes, and part 4: Term, name, title sources.

References LC Network (Library of Congress, Network Development and MARC Standards Office). 2002b. “MARC Standards.” At http://lcweb.loc.gov/marc/ (accessed 28 January 2003). LC Network (Library of Congress, Network Development and MARC Standards Office). 2002c. “MODS: Metadata Object Description Schema.” At http:// www.loc.gov/standards/mods// (accessed 28 January 2003). LC Network (Library of Congress, Network Development and MARC Standards Office). 2003. “MARCXML: MARC 21 XML Schema.” At http://www.loc.gov/ standards/marcxml/// (accessed 1 February 2003). Lenz, Evan. 2001. “XQuery: Reinventing the Wheel?” At http://www.xmlportfolio. com/xquery.html (accessed 2 January 2003). Li, Ying; Dick R. Miller, and Mary Buttner. 2002. “Bibliographic Data Mining: Automatically Building Component Part Records for e-Journal Articles on the Internet.” Journal of Internet Cataloging 5, no. 1:29–41. Meier, Wolfgang M. 2003. “eXist: Open Source XML Database.” At http://www. exist-db.org/ (accessed 2 February 2003). Microsoft Corporation. 2002. “Microsoft Releases First Beta of ‘Office 11.’ ” 22 October 2002. At http://www. microsoft.com/presspass/press/2002/ Oct02/10-22Office11Beta1PR.asp (accessed 30 January 2003). Miller, Dick R. 2000. “XML: Libraries’ Strategic Opportunity.” Library Journal 125 (“NetConnect” supplement): 18–20, 22. Also available at http://xmlmarc. stanford.edu/LJ/ (accessed 1 February 2003). Miller, Dick R., and Kevin S. Clarke. 2002. “XOBIS: The XML Organic Bibliographic Information Schema.” At http://

193

elane.stanford.edu/laneauth/XOBIS.pdf (accessed 28 January 2003). NLM (National Library of Medicine). 2002. “LinkOut DTD Version 1.1.” At http:// www.ncbi.nlm.nih.gov/entrez/linkout/ doc/LinkOut.dtd (accessed 2 February 2003). NLM (National Library of Medicine). 2003. [Untitled text file of journal titles and providers]. At http://www.ncbi.nlm.nih. gov/entrez/journals/prov_jour.txt (accessed 2 February 2003). NLM (National Library of Medicine). 2003b. “Archiving and Interchange DTD.” At http://dtd.nlm.nih.gov/ (accessed 25 June 2003). OASIS (Organization for the Advancement of Structured Information Standards). 2002. “OASIS Technical Committee: RELAX NG.” At http://www.oasisopen.org/committees/relax-ng/ (accessed 28 January 2003). OASIS (Organization for the Advancement of Structured Information Standards). 2003. OASIS [website]. At http://www. oasis-open.org (accessed 28 January 2003). OASIS (Organization for the Advancement of Structured Information Standards). 2003b. “OASIS Technical Committee: DocBook.” At http://www.oasis-open. org/committees/docbook/ (accessed 28 January 2003). OCLC (OCLC Online Computer Library Center, Inc., Office of Research). 2002. “FAST: Faceted Application of Subject Terminology.” At http://www.oclc.org/ research/projects/fast/ (accessed 30 January 2003). OSS4LIB. 2003. At http://www.oss4lib.org (accessed 28 January 2003). Pestov, Slava. 2003. “JEdit: Open Source Programmer’s Text Editor.” At http:// www.jedit.org (accessed 28 January 2003).

194

References

Smith, Douglas. 2000. “XForm: SelfValidating Web Forms.” At http://xform. nanoworks.org/ (accessed 28 January 2003). St.Laurent, Simon. 2000. “Microsoft DHTML Dude Disses Standards.” In XMLHack. At http://www.xmlhack. com/read.php?item=278 (accessed 27 January 2003). Suber, Peter. 2003. “Removing the Barriers to Research: An Introduction to Open Access for Librarians.” College and Research Libraries News 64:92–94, 113. Unabridged version available at http://www.earlham.edu/~peters/writing/ acrl.htm (accessed 23 January 2003). Suber, Peter, ed. 2003b. “FOS Newsletter.” At http://www.earlham.edu/~peters/ fos/fosblog.html (accessed 30 January 2003). This site also includes a guide to terminology, references, timeline of events, etc., in the Free Online Scholarship movement. Tanenbaum, Andrew S. 2002. Computer Networks. 4th ed. Upper Saddle River, N.J.: Prentice-Hall. Tauber, James, and Linda van den Brink. 2003. XMLSoftware.com [website]. At http://www.xmlsoftware.com (accessed 2 February 2003). TEI Consortium. 2001. “TEI P4: Guidelines for Electronic Text Encoding and Interchange. XML-Compatible Ed.” At http://www.tei-c.org/P4X/ (accessed 28 December 2002). Tennant, Roy. 2002. “XML: The Digital Library Hammer.” Library Journal Online. At http://libraryjournal. reviewsnews.com/index.asp?layout= articleArchive&articleId=CA156526& display=searchResults&stt=001& publication=libraryjournal (accessed 2 February 2003). Tennant, Roy, ed. 2002b. XML in Libraries. New York: Neil-Schuman.

Unicode, Inc. 2002. “Unicode Home Page.” At http://www.unicode.org/ (accessed 28 December 2002). VRA (Visual Resources Association, Data Standards Committee). 2002. “VRA Core Categories, Version 3.0.” At http:// www.vraweb.org/vracore3.htm (accessed 28 January 2003). W3C (World Wide Web Consortium). 1999. “Character Entity References in HTML 4.” At http://www.w3.org/TR/ REC-html40/sgml/entities.html (accessed 15 January 2003). W3C (World Wide Web Consortium). 2002. “Extensible Markup Language (XML) 1.1: W3C Candidate Recommendation 15 October 2002.” At http://www.w3. org/TR/xml11/ (accessed 28 January 2003). W3C (World Wide Web Consortium). 2002b. “XHTML 1.0: The Extensible Hypertext Markup Language.” 2d ed. W3C Recommendation 26 January 2000, revised 1 August 2002. At http://www.w3.org/TR/xhtml1/ (accessed 18 January 2003). W3C (World Wide Web Consortium). 2003. “Amaya Home Page.” At http://www. w3.org/Amaya/ (accessed 2 February 2003). W3C (World Wide Web Consortium). 2003b. “W3C Voice Browser Activity.” At http://www.w3.org/Voice/ (accessed 2 February 2003). W3C (World Wide Web Consortium). 2003c. “W3C: World Wide Web Consortium.” At http://www.w3c.org (accessed 28 January 2003). XML4Lib Electronic Discussion. 2001– . At http://sunsite.berkeley.edu/XML4Lib/ (accessed 28 January 2003). XMLMind. 2003. “XMLMind XML Editor: Product.” At http://www.xmlmind.com/ xmleditor/ (accessed 2 February 2003).

Index

& (ampersand) in entity references, 18–20 < > (angle brackets). See Angle brackets (< >) ’ (apostrophe), 19 * (asterisk). See Asterisk (*) : (colon). See Colon (:) , (comma) as Boolean operator in DTDs, 53 = (equal sign) and attributes, 14 ! (exclamation). See Exclamation (!) - (hyphen). See Hyphen (-) () (parentheses) in DTDs, 53 . (period or dot). See Period (dot) + (plus sign) in DTDs, 52 # (pound sign) in XPointer, 73 ? (question mark). See Question mark (?) “” or ‘ ’ (quotes, double or single). See Quotes, double (“”) or single (‘’) ; (semicolon) in entity references, 18–20 / (slash). See Slash (/) ( ) (spaces). See Blank (keyboard space) character; Spaces ( ); White space [ ] (square brackets). See Square brackets ([ ]) _ (underscore) in element names, 9 | (vertical bar) as Boolean operator in DTDs, 53 A AACR (Anglo-American Cataloguing Rules), 116–32, 135 abbreviations in AACR, 135 AbiWord, 180–81 access, ease of, 99 access points. See Headings “actuate” attribute in XLink, 72 added entries and relationships, 125 ALA character set, 24 alphabetical lists. See also Headings in cataloging, 118 vs. relationships, 130–31 Amaya browser, 165, 167–70 American National Standard Extended Latin (ANSEL) code, 24, 25, 115

AML (Astronomical Markup Language), 39 ampersand (&) in entity references, 18–20 analytics in cataloging, 121–22 “ancestor” operators, 75 anchors and advanced linking techniques, 41 disadvantages, 71 angle brackets (< >) in tags, 5 in text, 19 Anglo-American Cataloguing Rules (AACR), 116–32, 135 “annotation” namespace in RELAX NG, 60 “Annotations” standard, 167, 169–70 ANSEL code (American National Standard Extended Latin), 24, 25, 115 Apache Software Foundation, 171 apostrophe (’), 19 applications forms design, 174–76 graphics, 176–78 sound, 179–80 technical manuals, 179 text-centric standards, 179 XInclude, 174 archival cataloging collections in, 122 EAD, 102, 179 titles in, 119 Archiving and Interchange DTD of NLM, 101 ASCII code, disadvantages of, 24 asterisk (*) in DTDs, 52 in validation attributes, 56 Astronomical Markup Language (AML), 39 attributes, 14–16. See also specific attributes, e.g., “Authority” attribute added to elements, 9 for date information, 107 definition, 6

195

196

Index

attributes (cont’d) in MARC data elements, 108–10 naming of, 14–15 and validation in XML Schema, 56 vs. child elements, 16 and well-formedness, 27 in XSL FO, 85 “authority” attribute in XOBIS, 139 authority control FRANAR, 102 and MARC coding, 104, 113–15 of name-title entries, 119 relationships in, 112 and universal character set, 41–42 and XML, 101–2 authority records for publishers, 129–30 redundancy in, 124 relationships among, 118, 122 for suites of records, 126 author-title entries. See Name-title and authortitle entries B bibliographic control and transcription of titles, 125–26 and XML metadata, 98–99, 101–2 bibliographic formats in MARC, 103–4 BiomedCentral, 97 Biosequence Markup Language (BSML), 39 BitFlux Editor, 153–56 blank (keyboard space) character, 21. See also White space blocks of data and CSS, 88–89 book as model of information management, 37–38 Boolean operators in DTDs, 53 borders in stylesheets, 89 boundaries in schema design, 93, 94 boxes in stylesheets, 89 BSML (Biosequence Markup Language), 39 Budapest Open Access Initiative, 97 C capitalization conventions, 134 carriage return characters, 21 Cascading Style Sheets (CSS), 77, 85–90 case in element names, 10, 26 catalog records, enhancement of, 184–86 cataloging codes AACR. See AACR (Anglo-American Cataloguing Rules)

FRBR. See FRBR (Functional Requirements for Bibliographic Records) international, 130 cataloging practice, 117–20 CDATA (character data) definition, 6 use of, 20–21 character sets coding of, 22–23 and “encoding attribute,” 24 MARC, 115–16 supported by XML text editors, 147 and XML, 115 characters, invisible, and white space, 21 characters, nonstandard, 19–20. See also Entity references Chemical Markup Language (CML), 39 child elements definition, 12 vs. attributes, 16 child list in DTD, 51–52 “choice” conditions, 53 choice mechanism in RELAX NG, 62–66 circulation systems, 38 “class” attribute in XOBIS, 139 client-side display, 76 clustering technique (RLIN), 124 CML (Chemical Markup Language), 39 Coalition for Networked Information, 97 Cocoon stylesheet engine, 161–63 collections, archival. See Archival cataloging colon (:) in element names, 9 namespaces, 32 in XPath, 69 comma (,) as Boolean operator in DTDs, 53 comments definition, 6 use of, 20 component works, cataloging of, 123–24 Conceptual Reference Model, 102 conditional requirements in RELAX NG, 66 container elements definition, 8–9 identification of, 95 use of, 11–14 in XOBIS, 142–43 content advantages of XML formats, 98 availability of, 3–4 and semantic markup, 5 separate from display, 12, 40–41

Index structure of, 93 vs. metadata in MARC, 108 content models. See Document structures contents notes in AACR, 123–24 context in XPath, 68 context node in XPath, 69–70 control fields in MARC, 103–5 controlled vocabularies, 132–33 corporate bodies, subordinate, 127, 130–31 CSS (Cascading Style Sheets), 77, 85–90 D “dark data,” 4 data entry editors, 95 data independence and library systems, 40–41 data longevity and data persistence, 39–40 data models. See Document structures data structures and schema creation, 48 data types. See Datatypes database fields as markup language, 4 XLinks to, 72 database records and XML markup, 3–4 data-centric documents choice of text editor, 156 definition, 6 marked-up example, 28–30 mixed content in, 17–18 root elements in, 11 “datatypeLibrary” attribute in RELAX NG, 60 datatypes in RELAX NG, 66–67 and schemas, 31 in XML Schema, 54 dates, coding of in MARC, 105–7 in XML, 107 dates, display of in CSS and XSL FO, 88, 89 in XSLT, 81–82 delimiters in MARC, 110 “descendant” instruction, 74–75 descriptive cataloging cataloging practice, 117–20 and World Wide Web, 126–30 design of schemas, 94–96 diacritics and BitFlux Editor, 155 in MARC, 115 in UTF-8, 25

197

Digital Library Federation, 97 digital resources, description of, 116–17, 126 discussion lists library applications, 96 and selection of software, 171 display. See also Stylesheets language-specific presentations, 41 separation from content, 40–41 “show” attribute in XLink, 71–72 of white space, 22 and XML text editors, 147–48 and XML transformers, 159 display markup in XHTML, 33–35 in XML, 33 DocBook, 173, 179 DOCTYPE (document type declaration). See Document type declaration (DOCTYPE) document delivery using LinkOut, 186–88 document models. See Document structures Document Object Model (DOM), 44, 168 Document Schema Definition Language (DSDL), 32 document structures, 30–36 consistency in, 12–13 coordination of among libraries, 38 Document Type Definition. See Document Type Definitions (DTD) namespaces, 32 schemas, 31–32 stylesheets, 33 XHTML, 33–35 in XML, 6–8 document type declaration (DOCTYPE) definition, 6 and Document Type Definitions, 31, 49 entity references, 19 Document Type Definitions (DTD) definition, 31 entity references, 19 schema tool, 49–54 and validity, 30 document-centric XML standards, 179. See also Text-centric documents DOM (Document Object Model), 44, 168 dot (period). See Period (dot) DSDL (Document Schema Definition Language), 32 DTD (Document Type Definitions). See Document Type Definitions (DTD) Dublin Core, 102

198

Index

E EAC (Encoded Archival Context), 102 EAD (Encoded Archival Description), 102, 179 EDI (Electronic Data Interchange), 39 EdItEUR as standard, 39 “edition” in AACR, 119 e-journals lists, 180–83 Electronic Data Interchange (EDI), 39 elements definitions, 5, 8–11, 55 identification of in schema design, 94 naming of, 9–11, 26 of XML documents, 6–8 empty elements definition, 9 use of, 16–17 and well-formedness, 26 Encoded Archival Context (EAC), 102 Encoded Archival Description (EAD), 102, 179 “encoding” attribute, 24 end of line characters, 21 entity references, 18–20 definition, 6 external and “standalone” attribute, 24 for non-keyable languages, 41 in PCDATA, 50–51 vs. CDATA sections, 20–21 and well-formedness, 27 “entry” in AACR, 118–19 equal sign (=) and attributes, 14 equivalence relationships in XOBIS, 142 exclamation (!) in CDATA, 21 in comments, 20 eXist, 189 EXSLT group, 161 extensibility of XML, 41 Extensible Markup Language (XML). See XML (Extensible Markup Language) extensible schema languages, definition, 54 Extensible Stylesheet Language Formatting Objects (XSL FO), 77, 82–85 “extension” element in XML Schema, 58 external DTDs and DOCTYPE, 49 F Faceted Application of Subject Technology (FAST), 132 false drops and post-coordination, 133 FAST (Faceted Application of Subject Technology), 132

“fer-ber.” See FRBR (Functional Requirements for Bibliographic Records) fixed-field codes in MARC, 113–15 flexibility in schema design, 95 of standards, 41 in XML, 38 “for-each” statements in XSL FO, 84 in XSLT, 79, 82 form letters, mixed content in, 18 formats, variation in, 122 formatting. See Display; Repurposing of content form/genre designations MARC coding for, 104, 108 vs. format coding, 122 in XML, 108 in XOBIS, 144 forms design, 174–76 FRANAR (Functional Requirements and Numbering of Authority Records), 102 FRBR (Functional Requirements for Bibliographic Records) history, 116 as standard, 102 VTLS support for, 124 Free Online Scholarship website, 97 free software, definition, 146 Free Software Foundation, 171 free-text searching, limitations of, 6 Functional Requirements and Numbering of Authority Records (FRANAR), 102 Functional Requirements for Bibliographic Records (FRBR). See FRBR (Functional Requirements for Bibliographic Records) functional requirements of schema, 94, 95 future-proofing, 39–40 G Gecko rendering engine, 165 Generalized Markup Language (GML), 2 Geography Markup Language (GML), 39 GIF images, 176 GML (Generalized Markup Language), 2 GML (Geography Markup Language), 39 “grammar” element in RELAX NG, 59–60 granularity choice of, 12–13 and choice of elements, 11 in displays, 79 in MARC, 115 in XSLT stylesheets, 81

Index graphics in XML, 176–78 “group” element in RELAX NG, 63 H headings cataloging practice, 117–20 used to collocate entries in index, 111–12 hierarchy in MARC documents, 115 in uniform title headings, 120–21 in XML documents, 7–8 HighWire Press, 97 HTML (Hypertext Markup Language) disadvantages, xii history, 2–3 loss of access to content, 3–4 stylesheets, 77 vs. XHTML, 33–35 vs. XLink, 70–71 vs. XML, 1–2, 26 and XPointer, 74 in XSLT stylesheets, 81 hyphen (-) double hyphens in comments, 20 in element names, 9 I “ID” attribute in XPointer, 73 ILS (integrated library systems). See Integrated library systems (ILS) “image” element in XLink, 72 implementation stage in schema development, 95–96 inclusion of other documents, 174 indexing vs. analytics, 122, 124 indicators in MARC, 110 “information silo” model, 98 infrastructure design, need for, 92 initial articles in MARC and AACR, 134–35 instance, definition, 30 intangible entities in XOBIS, 139 integrated library systems (ILS) and coordination of library information, 99–101 and open content initiatives, 97–98 integrating resources, cataloging of, 103, 122 “interleave” element in RELAX NG, 64 international bibliographic schema, 41, 102 International Standard Bibliographic Description (ISBD), 133

199

International Standard Book Number (ISBN) and International Standard Serial Number (ISSN), 111 Internet Explorer and browser wars, 165, 166–67 XML support, 35–36 interoperability. See Platform neutrality ISBD (International Standard Bibliographic Description), 133 ISBN (International Standard Book Number) and ISSN (International Standard Serial Number), 111 J JEdit, 148 JPEG images, 176 K Kawa, 163–65 keyboard space (blank) character, 21. See also White space keyword, target, and processing instructions, 26 L Lane Medical Library XML applications, 180–83 XOBIS, 136–44 language codes in MARC, 107–8 languages language-specific presentations, 41 marking of multilingual documents, 16 mixed-language headings, 130 LC (Library of Congress). See Library of Congress (LC) leaf element in XPath, 68 learning, ease of, 44 “level” attribute in RELAX NG, 66 libraries and browser development, 167 contribution to schema development, 91–92 and difficulties with information access, 98 and future of Web, 136–37 information management, 96–97 and open-source software, 146 shared technical infrastructure, 42 trends, 96–101 XML applications for, 36–39 Library of Congress (LC) and AACR, 116–17 MARCXML format, 101, 188 markup languages, 101 line feed and line separator characters, 21

200

Index

link histories in XLink, 71 linking methods. See also Inclusion of other documents AACR2, 118 MARC, 110, 111, 125 XLink, 70–73 XML, 41 LinkOut, 186–88 location of digital resources in MARC, 112 looping. See “For-each” statements M MARC, 102–16 character set, 115–16 complexity, 113–15 control fields and encoding, 103–5 conversion to XOBIS, 188–90 date coding, 105–7 elements and attributes, 108–10 and MODS coding, 30 nonfiling indicator system, 134 redundancy, 107–8 relationships, 110–13 updating with MARCUTL, 183–84 XML coding by Library of Congress, 101 MARC 21 repertoire, 25, 115 MARC-8 environment. See MARC 21 repertoire MARCUTL transformation language, 182, 183–86 MARCXML format, 101, 188 markup, definition, xii markup languages. See Document structures MathML (Mathematics Markup Language), 39, 167 “maxOccurs” attribute, 56 Medical Subject Headings (MeSH), 132 MEDLINE Document Type Definitions, 101. See also National Library of Medicine (NLM) MeSH (Medical Subject Headings), 132 Metadata Encoding and Transmission Standard (METS), 101 metadata in XML formats, 98–99 Metadata Object Description Schema (MODS) conversion from MARC, 188–89 example, 28–30 form/genre codes, 105 history, 101 METS (Metadata Encoding and Transmission Standard), 101 Microsoft Corporation, 54, 167

Microsoft Word, 180, 181 “minOccurs” attribute, 56 mixed content in elements, 17–18 and parsed character data, 51 and unwanted white space, 22 MML (Music Markup Language), 39 MODS (Metadata Object Description Schema). See Metadata Object Description Schema (MODS) Mozilla browser, 36, 165–67 multilingual documents. See Languages Music Markup Language (MML), 39 N name changes in XOBIS, 140 named character entities. See Characters, nonstandard namespace prefixes, 45, 47 namespaces advantages, 45–48 in DTDs, 53–54 and reuse of data, 13 use of, 32 under XML Schema, 54–55 name-title and author-title entries current cataloging practice, 117–19 effects of, 131–33 as pre-coordination, 112 as subjects, 132 naming of elements guidelines, 9–11 and well-formedness, 26 National Library of Medicine (NLM) markup languages, 101 MeSH, 132 PubMed Central, 97 PubMed LinkOut service, 186–88 navigational methods in XPath, 68–70 in XPointer, 74–75 navigational relationships in XOBIS, 143 nesting of start and end tags, 12, 27 Netscape 7 browser, 35, 165–67 new line character, 22 NLM (National Library of Medicine). See National Library of Medicine (NLM) nodes definition, 7 in XPath, 68–69 nonfiling characters, 15, 134 Notepad as XML text editor, 148

Index notes in cataloging relationships, 122–25 AACR, 118 MARC, 112 Notional elements in XOBIS, 139, 141 numerals in element names, 10 O OASIS (Organization for the Advancement of Structured Information Standards) standards, 44 on word processing, 180–81 object-oriented principles in XML Schema, 57–58 OeB (Open eBook), 179 “oneOrMore” element in RELAX NG, 62 ONIX for Serials as standard, 39 Open Archives Initiative, 97 open content and open access initiatives, 96–97 Open eBook (OeB), 179 open systems movement, 39 OpenOffice, 180–81 open-source software, 145–46, 171 Oracle XSLT processor, 171 order of elements, validation of, 63–64 Organization for the Advancement of Structured Information Standards (OASIS). See OASIS (Organization for the Advancement of Structured Information Standards) outline structure of XML, 6–8 outsourcing of cataloging, 126 P padding in stylesheets, 89 paragraph separator character, 22 parent elements, 12. See also Container elements parentheses () in DTDs, 53 parsed character data (PCDATA), 50–51 parsers definition, 6 and well-formed documents, 26 patterns in XML Schema, 58–59 PCDATA (parsed character data), 50–51 PDF files and XSL FO, 85 period (dot) in element names, 9 in XPath, 70 physical entities in XOBIS, 139 platform neutrality, 38–39 PLoS (Public Library of Science), 97 plus sign (+) in DTDs, 52 post-coordination, 132–33

201

pound sign (#) in XPointer, 73 pre-coordination of elements in MARC records, 112 name-title entries, 131–33 in RELAX NG, 63 in subject cataloging, 132–33 prefixes namespace prefixes, 45 and URIs, 32 preservation libraries’ role in, 136–37 and XML neutrality, 40 “preserve” attribute, 16 Principal Elements in XOBIS, 138–42 processing instructions CSS, 85–86 definition, 6, 26 “for-each” statement in XSLT, 79, 82 prolog in XML structure, 7, 23, 26 ProML (Protein Markup Language), 39 proprietary formats and data longevity, 40 Protein Markup Language (ProML), 39 pseudonyms in XOBIS, 140 PUBLIC DTD, 50 Public Library of Science (PLoS), 97 PubMed. See National Library of Medicine (NLM) punctuation, See also specific punctuation marks, e.g., Hyphen attributes, 14 conditional punctuation in MARC, 110 in content, 12 in element names, 9 ISBD, 133 reserved characters, alternatives for, 18 Q Qexo, 163–65 “qualified” attribute and namespaces, 55 qualifiers in authority records, 127 in descriptive cataloging, 131–32 in XOBIS, 143 query function and XML transformers, 159 XQuery, 163–65 question mark (?) in DTDs, 52 in processing instructions, 26 in XML declaration tag, 23 quotes, double (“”) or single (‘’) in attributes, 14 in text, 15

202

Index

R RDF (Resource Description Framework), 169 RecordList element in XOBIS, 138 recursion in XOBIS, 142, 144 redundancy in library information, 99 in MARC records, 107–8 referencing in XPath, 68–70 in XPointer, 73 reformatting of data. See Repurposing of content Regular Language for XML Next Generation (RELAX NG) Schema. See RELAX NG (Regular Language for XML Next Generation) Schema related works cataloging of, 123–24 in XOBIS, 139–40 relationships in AACR, 116–17 added entries, 125 in alphabetical lists, 118 in authority records, 112 in cataloging, 120–25 between element groups, 94–95 in MARC, 110–13 series as, 119 in XML schemas, 112–13 in XOBIS, 139 relationships element in XOBIS, 142–43 relator codes in MARC, 112 relator terms, 110, 112 RELAX NG (Regular Language for XML Next Generation) Schema, 59–67 advantages, 59 and data validation, 49 reuse of schema components, 58 vs. XML Schema, 56–57 and XOBIS, 144 repurposing of content, 40–41 marked-up information, 4 punctuation, 12 sharing of information, 38 standards, 41 stylesheets, 33 XML design, 12–13 reserved characters, alternatives for, 18. See also Entity references Resource Description Framework (RDF), 169 retrieval of documents in schema development, 96

and semantics in XML, 4 string searches with XPointer, 75–76 reuse of information. See Repurposing of content review of related schemas, 94 “role” attribute in XOBIS, 142 root element definition, 7, 8 namespace attribute in, 32 naming of, 11 and well-formedness, 26 in XOBIS, 138 S SAX (Simple API for XML), 44 Saxon transformation engine, 159–61 Scalable Vector Graphics (SVG) format, 167, 168, 173, 176–78 schema development, 93–96 schema languages choice of, 48–49 and librarians, 92 “schemaLocation” attribute, 55 schemas, definition. See Document structures scholarly journal publishing, 97 Scholarly Publishing and Academic Resources Coalition (SPARC), 97 semantic markup, 4, 5 semicolon (;) in entity references, 18–20 “Separation of Concerns” (SoC) design, 161–62 sequences, definition, 52 serializer in Cocoon, 162 serials cataloging of, 119–20 linking entries for, 118 varying titles, 127–28 vs. series, 121 serials lists, 180–83 series in AACR, 119–20 cataloging of, 121–22 relationships in, 111 as uniform titles, 121 Server Side Includes, 174 server-side display, 76 SGML (Standard Generalized Markup Language), 2 “show” attribute in XLink, 71–72 sibling elements, definition, 12 Simple API for XML (SAX), 44 sitemap in Cocoon, 162 “skip” attribute, 15

Index slash (/) in empty elements, 17 in end tags, 5 in XPath, 69 SoC (“Separation of Concerns”) design, 161–62 sound implementations, 179–80 spaces ( ). See also Blank (keyboard space) character; White space and attributes, 14 in element names, 10 formatting of, 22 SPARC (Scholarly Publishing and Academic Resources Coalition), 97 square brackets ([ ]) in CDATA, 21 and DOCTYPE, 31 “standalone” attribute, 24, 50 Standard Generalized Markup Language (SGML), 2 standard numbers and relationships to other works, 111 standards, 44–90 adoption of XML-based, 39 Annotations, 167, 169–70 in browsers, 166–67 MathML, 167 and schema development, 92 word processors, 180–81 “start” element in RELAX NG, 60 statements of responsibility in contents notes, 123 in edition and series fields, 119 in title field, 118 string searches with XPointer, 75–76 structure of documents. See Document structures stylesheets, 75–90 CSS stylesheets, 85–89 and processing instructions, 26 and punctuation, 134 in schema development, 95 use of, 33 and XML text editors, 147–48 XSL FO stylesheets, 82–85 XSLT, 77–82 subject cataloging, 132–33 subordinate bodies, 127, 130–31 Substantive Elements in XOBIS, 139–41, 142 “substitute” attribute in XOBIS, 142–43 successive entry in serial cataloging, 128 SVG (Scalable Vector Graphics) format, 167, 168, 173, 176–78

203

syntax of XML, 3–4, 5–6 SYSTEM DTD, definition, 50 T tab character, 22 tags in HTML, 1–2 and well-formedness, 26 in XML, 3, 5 technical infrastructure, shared, 42 technical manuals, applications for, 179 TEI (Text Encoding Initiative), 27–29, 179 template element in XSL FO, 84 testing of schemas, 95 text editors. See XML text editors Text Encoding Initiative (TEI), 27–29, 179 text-centric documents. See also Documentcentric XML standards choice of text editor, 156 definition, 6 marked-up example, 27–28 mixed content in, 17–18 root elements in, 11 TIFF images, 176 titles cataloging of, 118–19 in contents notes, 123 transcription of, 125–26 tools, choice of, 95. See also Transformers; XML browsers; XML text editors transformers, 159–65 Cocoon and Xalan, 161–63 Kawa and Qexo, 163–65 MARCUTL, 182, 183–86 Saxon, 159–61 translations, linking entries for, 124–25 tree structure in XML, 7–8 trends, 96–101, 173–81 “type” attribute use of, 15–16 in XOBIS, 143 U UCS (Universal Character Set), 25, 41–42. See also Unicode underscore (_) in element names, 9 Unicode numeric character entities for, 19–20 and standards, 41–42 support in text editors, 147 vs. MARC 21 repertoire, 115 XML designated character set, 22–23

204

Index

Unicode Transformation Format 8-bit (UTF-8), 24–26 Uniform Resource Identifier (URI) MARC coding for, 112 and namespaces, 32, 46–47 and relationship to other works, 111 and XPointer, 73 uniform title headings, 120–21, 128–29 Universal Character Set (UCS), 25, 41–42. See also Unicode URI (Uniform Resource Identifier). See Uniform Resource Identifier (URI) “use” attributes, 56 UTF-8 (Unicode Transformation Format 8-bit), 24–26 V validation definition of, 30–31 in RELAX NG, 59 and schema development, 93 and standards, 48–67 using schema languages, 31–32 Varia element in XOBIS, 142–43 Version element in XOBIS, 139–40 vertical bar (|) as Boolean operator in DTDs, 53 Visual Resources Association, 102 visually impaired users. See Sound implementations VoiceXML, 39, 173, 179–80 W W3C (World Wide Web Consortium). See World Wide Web Consortium (W3C) Web inconsistency of description, 127 information management on, 37 web browsers. See also Internet Explorer Amaya browser, 165, 167–70 and BitFlux Editor, 153–55 choice of, 166 Mozilla, 35, 165–67 Netscape 7 browser, 35, 165–67 and XML support, 35–36 well-formedness, 26–27, 147 white space, 16, 21–22. See also Blank (keyboard space) character word-processing programs, 180–81 World Wide Web Consortium (W3C) history, 3 and standards, 44 and XML Schema, 54

X Xalan stylesheet engine, 161–63 XForm (Nanoworks), 174–75 XForms (W3C), 173, 174–76 XHTML and browsers, 165 empty elements, 17 history, 3 stylesheets, 77 vs. HTML, 33–35 XInclude, 173, 174 XLink (XML Linking Language), 70–73 and linking, 41 uses of, 67 vs. XInclude, 174 as XML tool, 36 XML applications. See Applications; Document structures XML browsers, 165–70. See also Web browsers XML Database Application Programming Interface (XMLDB API), 189 XML declaration, 23–26 XML eXchaNGer (XNGR), 156–58 XML (Extensible Markup Language) advantages, xi–xiii definition, 1–2 and difficulties with information access, 98–99 features, 4–27 generic aspects of, 39–42 history, 3 and problems with digital resources, 37 summary of types, 27–30 XML grammars. See Document structures XML Linking Language (XLink). See XLink (XML Linking Language) XML Organic Bibliographic Information Schema (XOBIS). See XOBIS (XML Organic Bibliographic Information Schema) XML Schema, 49, 54–59 XML Stylesheet Language Transformations (XSLT). See XSLT (XML Stylesheet Language Transformations) XML syntax, 3–4, 5–6 XML text editors, 146–58 Amaya as, 170 BitFlux Editor, 153–56 choice of, 156 encoding settings in, 24 features of, 147–48 JEdit, 148–49 proprietary editors, 171

Index and well-formedness, 26 XML Spy, 171 XML XNGR, 147, 156–58 XMLMind Editor, 179 XMLOperator, 150–53 XNGR, 156–58 XML4LIB electronic discussion list, 96 XMLDB API (XML Database Application Programming Interface), 189 “xml:lang” attribute, 16, 41 XMLMARC program, 184–86 XMLMind Editor, 179 “xmlns” attributes, 45 XMLOperator, 150–53 “xml:space” attribute, 16 “xml-stylesheet” keyword, 26 “xml:stylesheet” tag in XSL FO, 82 XNGR (XML eXchaNGer), 156–58 XOBIS (XML Organic Bibliographic Information Schema), 136–44 background, 136–37 conversion from MARC, 188–90 as namespace, 47–48 overview, 137–44 XPath and linking, 41 uses of, 67, 68–70

vs. XQuery, 163 as XML tool, 36 XPointer and linking, 41 use of, 67–68, 73–76 as XML tool, 36 XQuery and Qexo, 163–65 as XML tool, 36 XSL FO (Extensible Stylesheet Language Formatting Objects), 77, 82–85 XSLT processor (Oracle), 171 XSLT (XML Stylesheet Language Transformations), 77–82 editors for, 159 mapping from interim schemas, 94 and stylesheets, 33 vs. XQuery, 164–65 “xsl:template” tag, 80 Z Z39.47-1985, 24 Z39.50 connectivity, 100 “ZeroOrMore” element in RELAX NG, 62, 64–65

205

Dick R. Miller is the head of technical services at the Lane Medical Library at the Stanford University Medical Center. His extensive information systems experience led him to promote using XML in libraries, notably in “XML: Libraries’ Strategic Opportunity,” published in the summer 2000 issue of Library Journal NetConnect. He also led in the development of XOBIS, an experimental schema for bibliographic and authority information, and he has advocated an XML replacement for MARC. Miller was formerly an associate librarian at the Northeastern Ohio Universities College of Medicine. He earned his M.L.S. degree from the University of Oklahoma. Kevin S. Clarke is a digital information systems developer at the Lane Medical Library at the Stanford University Medical Center. He was formerly a digital information systems programmer at Lane and a cataloging assistant at the University of North Carolina at Chapel Hill. He was a co-presenter on “XML for Librarians,” a continuing education course at the Medical Library Association meeting in Florida in 2001, and he wrote “Updating MARC Records with XMLMARC” in XML in Libraries (2002). He received an M.S.I.S. degree from the University of North Carolina at Chapel Hill.

Putting Xml to Work in the Library: Tools for Improving Access and Management

Putting Xml to Work in the Library: Tools for Improving Access and Management

Philosophical Tools for Technological Culture: Putting Pragmatism to Work

Change Management Excellence: Putting NLP to Work

Putting Auction Theory to Work

Improving Access to Public Transport

Improving Access To Public Transport

Putting Balanced Scorecard to Work

Putting auction theory to work

Statistics and Truth, Putting Chance to Work

XML in Data Management

Understanding SGML and XML Tools

Putting Biotechnology to Work: Bioprocess Engineering

The Management Tool Kit: Tools and Techniques That Work

Mind Tools - Techniques for improving your memory

XML Data Management: Native XML and XML-Enabled Database Systems

Learning How to Learn: Tools for Schools (Improving Practice (TLRP))

Tools for Thinking: Modelling in Management Science

Tools for Thinking: Modelling in Management Science

Intangible Management: Tools for Solving the Accounting and Management Crisis

Intangible Management: Tools for Solving the Accounting and Management Crisis

Career Development and Counseling: Putting Theory and Research to Work

Server Data Access With Java and XML

Career Development and Counseling : Putting Theory and Research to Work

Flexible Learning and Human Resource Development: Putting Theory to Work

Improving online public access catalogs

Putting Nigeria to Work: A Strategy for Employment and Growth (Directions in Development: Countries and Regions)

Improving Access and of Psychotropic Medicines

Human factors in project management: concepts, tools, and techniques for inspiring team work and motivation

Putting Xml to Work in the Library: Tools for Improving Access and Management