01 543555 Ch01.qxd 11/5/03 9:40 AM Page 1 1 Xerces XML parsing is the foundational building block for every other tool we’ll be looking at in this book. You can’t use Xalan, the XSLT engine, without an XML parser because the XSLT stylesheets are XML documents. The same is true for FOP and its input XSL:FO, Batik and SVG, and all the other Apache XML tools.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 2 Chapter 1 ❑ Java APIs for XML Parsing 1.2 ❑ XML Schema 1.0 (Schema and Structures) The current release of Xerces (2.4.0) also has experimental support for: ❑ XML 1.1 Candidate Recommendation ❑ XML Namespaces 1.1 Candidate Recommendation ❑ DOM Level 3 (Core, Load/Save) A word about experimental functionality: one of the goals of the Xerces project is to provide feedback to the various standards bodies regarding specifications that are under development.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 3 Xerces process of defining XML 1.1. When they have finished their work, you will be able to supply 1.1 in addition to 1.0 for the version number. If there is no encoding declaration, then the document must be encoded using UTF-8. If you forget to specify an encoding declaration or specify an incorrect encoding declaration, your XML parser will report a fatal error. We’ll have more to say about fatal errors later in the chapter.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 4 Chapter 1 default namespace. There is no way to get one "automatically". If you don’t define a default namespace, and then you write an unprefixed element or attribute, that element or attribute is in no namespace at all. Namespace prefixes can be declared on any element in a document, not just the root element. This includes changing the default namespace.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 5 Xerces DTDs The XML 1.0 specification describes a grammar using a document type declaration (DTD). The language for writing a DTD is taken from SGML and doesn’t look anything like XML. DTDs can’t deal with namespaces and don’t allow you to say anything about the data between a start and end tag.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 6 Chapter 1 There’s a separate mechanism for associating a schema with a document that has no namespace (xsi:noNamespaceSchemaLocation). For completeness, here’s the XML Schema document that describes book.xml. 1: 2: PAGE 701 543555 Ch01.qxd 11/5/03 9:40 AM Page 7 Xerces A parser API makes the various parts of an XML document available to your application. You’ll be seeing the SAX and DOM APIs in most of the other Apache XML tools, so it’s worth a brief review to make sure you’ll be comfortable during the rest of the book. Let's look at a simple application to illustrate the use of the parser APIs. The application uses a parser API to parse the XML book description and turn it into a JavaBean that represents a book.
01 543555 Ch01.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 9 Xerces 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: /* * * SAXMain.java * * Example from "Professional XML Development with Apache Tools" * */ package com.sauria.apachexml.ch1; import java.io.IOException; import org.apache.xerces.parsers.SAXParser; import org.xml.sax.SAXException; import org.xml.sax.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 10 Chapter 1 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: * Example from "Professional XML Development with Apache Tools" * */ package com.sauria.apachexml.ch1; import java.util.Stack; import import import import org.xml.sax.Attributes; org.xml.sax.SAXException; org.xml.sax.SAXParseException; org.xml.sax.helpers.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 11 Xerces The startElement callback basically sets things up for new data to be collected each time it sees a new element. It creates a new currentText StringBuffer for collecting this element’s text content and pushes it onto the textStack. It also pushes the element’s name on the elementStack for placekeeping. This method must also do some processing of the attributes attached to the element, because the attributes aren’t available to the endElement callback.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 12 Chapter 1 75: 76: 77: 78: 79: public void characters(char[] ch, int start, int length) throws SAXException { currentText.append(ch, start, length); } The remainder of BookHandler implements the three public methods of the ErrorHandler callback interface, which controls how errors are reported by the application. In this case, you’re just printing an extended error message to System.out.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 13 Xerces DOM Let’s look at how you can accomplish the same task using the DOM API. The DOM API is a tree-based API. The parser provides the application with a tree-structured object graph, which the application can then traverse to extract the data from the parsed XML document. This process is more convenient than using SAX, but you pay a price in performance because the parser creates a DOM tree whether you’re going to use it or not.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 14 Chapter 1 35: 36: 37: 38: 39: 40: 41: } catch (IOException ioe) { System.out.println("I/O Error during parsing " + ioe.getMessage()); ioe.printStackTrace(); } } The dom2Book function creates the Book object: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: private static Book dom2Book(Document d) throws SAXException { NodeList nl = d.getElementsByTagNameNS(bookNS, "book"); Element bookElt = null; Book book = null; try { if (nl.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 15 Xerces 68: 69: 70: 71: 72: 73: 74: 75: 76: 77: 78: 79: 80: 81: 82: 83: 84: 85: 86: 87: 88: 89: } book.setAuthor(text); } else if (e.getTagName().equals("isbn")) { book.setIsbn(text); } else if (e.getTagName().equals("month")) { book.setMonth(text); } else if (e.getTagName().equals("year")) { int y = 0; try { y = Integer.parseInt(text); } catch (NumberFormatException nfe) { throw new SAXException("Year must be a number"); } book.setYear(y); } else if (e.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 16 Chapter 1 ❑ data—A directory containing sample XML files. ❑ docs—A directory containing all the documentation. ❑ Readme.html—The jump-off point for the Xerces documentation; open it with your Web browser. ❑ samples—A directory containing the source code for the samples. ❑ xercesImpl.jar—A jar file containing the parser implementation. ❑ xercesSamples.jar—A jar file containing the sample applications. ❑ xml-apis.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 17 Xerces Xerces uses the SAX features and properties mechanism to control all configuration settings. This is true whether you’re using Xerces as a SAX parser or as a DOM parser. The class org.apache.xerces .parsers.DOMParser provides the methods setFeature, getFeature, setProperty, and getProperty, which are available on the class org.xml.sax.XMLReader. These methods all accept a String as the name of the feature or property.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 18 Chapter 1 Error-Reporting Features The next set of features controls the kinds of errors that Xerces reports. The feature http://apache.org /xml/features/warn-on-duplicate-entitydef generates a warning if an entity definition is duplicated. When validation is turned on, http://apache.org/xml/features/validation/warn-on-duplicateattdef causes Xerces to generate a warning if an attribute declaration is repeated. Similarly, http://apache.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 19 Xerces Xerces uses the SAX ErrorHandler interface to handle errors while parsing using the DOM API. You can register your own ErrorHandler and customize your error reporting, just as with SAX. However, you may want to access the DOM node that was under construction when the error condition occurred. To do this, you can use the http://apache.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 20 Chapter 1 Deferred DOM One of the primary difficulties with using the DOM API is performance. This issue manifests itself in a number of ways. The DOM’s representation of an XML document is very detailed and involves a lot of objects. This has a big impact on performance because of the time it takes to create all those objects, and because of the amount of memory those objects use.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 21 Xerces Here’s the SAXMain program, enhanced to perform schema validation: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: /* * * SchemaValidateMain.java * * Example from "Professional XML Development with Apache Tools" * */ package com.sauria.apachexml.ch1; import java.io.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 22 Chapter 1 53: 54: 55: 56: } e.printStackTrace(); } } Additional Schema Checking The feature http://apache.org/xml/features/validation/schema-full-checking turns on additional checking for schema documents. This doesn’t affect documents using the schema but does more thorough checking of the schema document itself, in particular particle unique attribute constraint checking and particle derivation restriction checks.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 23 Xerces used a different or buggy version of the schema you’re using. Worse, the author of the incoming document may intentionally specify a different version of the schema in an attempt to subvert your application. The second reason you may choose to ignore these hints is that you might want to provide a local copy of the schema so the validator doesn’t have to perform a network fetch of the schema document every time it has to validate a document.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 24 Chapter 1 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: import java.io.IOException; import import import import import org.apache.xerces.parsers.SAXParser; org.xml.sax.SAXException; org.xml.sax.SAXNotRecognizedException; org.xml.sax.SAXNotSupportedException; org.xml.sax.XMLReader; public class PassiveSchemaCache { public static void main(String[] args) { System.setProperty( "org.apache.xerces.xni.parser.Configuration", "org.apache.xerces.parsers.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 25 Xerces 56: 57: 58: 59: } } } Although passive caching is easy to use, it has one major drawback: You can’t specify which grammars Xerces can cache. When you’re using passive caching, Xerces happily caches any grammar it finds in any document.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 26 Chapter 1 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56: 57: 58: 59: 60: 61: 62: 63: 64: 65: 66: static final String GRAMMAR_POOL = Constants.XERCES_PROPERTY_PREFIX + Constants.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 27 Xerces Now you’re ready to associate a grammar pool with the preparser. This is done using the preparser’s setProperty method and supplying the appropriate values (line 45). XMLGrammarPreparser provides a feature/property API like the regular SAX and DOM parsers in Xerces.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 28 Chapter 1 85: 86: 87: 88: 89: } try { if (reader == null) reader = new SAXParser(parserConfiguration); Something else is going on here: each instance of ActiveCache has a single SAXParser instance associated with it. You create an instance of SAXParser only if one doesn’t already exist. This cuts down on the overhead of setting up and tearing down parser instances all the time. One other detail.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 29 Xerces If you’re using the grammar-caching mechanism to cache DTDs, be aware that it can only cache external DTD subsets (DTDs in an external file). In addition, any definitions in an internal DTD subset (DTD within the document) will be ignored. Entity Handling Earlier in the chapter we mentioned that we’d be looking at a mechanism that can do the same job as the Xerces properties for xsi:schemaLocation and xsi:noNamespaceSchemaLocation.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 30 Chapter 1 35: 36: 37: 38: r.setContentHandler(bookHandler); r.setErrorHandler(bookHandler); EntityResolver bookResolver = new BookResolver(); r.setEntityResolver(bookResolver); The EntityResolver interface originated in SAX, but it’s also used by the Xerces DOM parser and by the JAXP DocumentBuilder. All you need to do to make it work is create an instance of a class that implements the org.xml.sax.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 31 Xerces 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: } public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException { if (systemId.equals(schemaURI)) { FileReader r = new FileReader("book.xsd"); return new InputSource(r); } else return null; } The general flow of a resolveEntity method is to look at the publicId and/or systemId arguments and decide what you want to do.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 32 Chapter 1 Xerces provides two features that cause startEntity and endEntity to report the beginning and end of these two classes of entity references. The feature http://apache.org/xml/features/scanner/notifybuiltin-refs causes startEntity and endEntity to report the start and end of one of the built-in entities, and the feature http://apache.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 33 Xerces If you look closely at the diagram, you see that the part of the DOM tree for element c has been omitted. Here’s the rest of it, starting at the Element node for c. Element c Text [If]text in c but Entity Reference Text [If] Element d Text [If] Text insert this here Xerces created an EntityReference node as a child of the Element node (and in the correct order among its siblings).
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 34 Chapter 1 Serialization Most of the classes included with Xerces focus on taking XML documents, extracting information out of them, and passing that information on to your application via an API. Xerces also includes some classes that help you with the reverse process—taking data you already have and turning it into XML. This process is called serialization (not to be confused with Java serialization).
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 35 Xerces Using the serializer classes is fairly straightforward. The serialization classes live in the package org.apache.xml.serialize. All the serializers are constructed with two arguments: The first argument is an OutputStream or Writer that is the destination for the output, and the second argument is an OutputFormat object that controls the details of how the serializer formats its input.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 36 Chapter 1 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: } snse.printStackTrace(); } try { r.parse(args[0]); } catch (IOException ioe) { ioe.printStackTrace(); } catch (SAXException se) { se.printStackTrace(); } } Note that you set up the serializer (in this case, an XMLSerializer) and then plug it into the XMLReader as the callback handler for ContentHandler, DTDHandler, DeclHandler, and LexicalHandler.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 37 Xerces 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43: } ioe.printStackTrace(); } Document d = p.getDocument(); OutputFormat format = new OutputFormat(Method.XML,"UTF-8",true); format.setPreserveSpace(true); XMLSerializer serializer = new XMLSerializer(System.out, format); try { serializer.serialize(d); } catch (IOException ioe) { ioe.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 38 Chapter 1 The following set of methods deals with the DOCTYPE declaration: Method Description String getDoctypePublic() Gets the public ID of the current DOCTYPE. String getDoctypeSystem() Gets the system ID of the current DOCTYPE. void setDocType(String publicId, String systemID) Sets the public ID and system ID of the current DOCTYPE.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 39 Xerces XNI Basics An XNI-based parser contains two pipelines that do all the work: the document pipeline and the DTD pipeline. The pipelines consist of instances of XMLComponent that are chained together via interfaces that represent the streaming information set. Unlike SAX, which has a single pipeline, XNI divides the pipeline in two: one pipeline for the content of the document and a separate pipeline for dealing with the information DTD.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 40 Chapter 1 XMLString, XNIException, Augmentations, QName, XMLAttributes, XMLLocator, XMLResourceIdentifier, and NamespaceContext are all used by one of the four major interfaces (XMLDocumentHandler, XMLDocumentFragmentHandler, XMLDTDHandler, and XMLDTDContentModelHandler). If you look at the XMLComponent interface, you’ll see that it really just defines methods for setting configuration settings on a component.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 41 Xerces supports a particular feature or property. XMLParserConfiguration adds APIs that let you do several categories of tasks: ❑ Configuration—This API provides methods to tell configuration clients the set of supported features and properties. It also adds methods for changing the values of features and properties.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 42 Chapter 1 Document Scanner The document scanner knows how to take an XML document and fire the callbacks for elements (and attributes), characters, and anything else you might encounter in an XML document. This is the workhorse component for any XNI application that is going to work with an XML document. Applications that just work with the DTD or schema may end up not using this class. The document scanner is implemented by the class org,apache.xerces.impl.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 43 Xerces Error Reporter The parser configuration needs a single mechanism that all components can use to report errors. The Xerces2 error reporter provides a single point for all components to report errors. It also provides some support for localizing the error messages and calling the XNI XMLErrorHandler callback. Localization works as follows. Each component is given a domain designated by a URI. The component then implements the org.apache.xerces.util.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 44 Chapter 1 If you’re working with SAX, the first place to go is to the SAX Counter sample. This sample parses your document and prints some statistics based on what it finds. To invoke Counter, type java sax.Counter There are command-line options to turn on and off namespace processing, validation, and schema validation, and to turn on full checking of the schema document.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 45 Xerces that converts XNI events into the SAX events that Jing already understands. This wrapped version of Jing is then inserted into the appropriate spot in the XNI pipeline within an XMLParserConfiguration called JingConfiguration. For ease of use, Andy has again provided convenience classes that work just like the Xerces SAX and DOM parser classes. For a Relax-NG aware SAX parser, use org.cyberneko .relaxng.parsers.SAXParser; for a DOM parser, use org.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 46 Chapter 1 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: * Example from "Professional XML Development with Apache Tools" * */ package com.sauria.apachexml.ch1; import java.io.IOException; import java.util.Stack; import import import import import import import import org.apache.xerces.xni.XMLAttributes; org.apache.xerces.xni.XNIException; org.apache.xerces.xni.parser.XMLInputSource; org.cyberneko.pull.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 47 Xerces 41: 42: 43: 44: 45: 46: 47: 48: 49: 50: 51: 52: 53: 54: 55: 56: 57: 58: XMLEvent evt; while ((evt = pullParser.nextEvent()) != null) { switch (evt.type) { case XMLEvent.ELEMENT : ElementEvent eltEvt = (ElementEvent) evt; if (eltEvt.start) { textStack.push(new StringBuffer()); String localPart = eltEvt.element.localpart; if (localPart.equals("book")) { XMLAttributes attrs = eltEvt.attributes; String version = attrs.getValue(null, "version"); if (version.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 48 Chapter 1 76: 77: 78: 79: 80: } else if (localPart.equals("publisher")) { book.setPublisher(text); } else if (localPart.equals("address")) { book.setAddress(text); } When you see a CharactersEvent, you’re appending the characters in the event to the text you’re keeping for this element: 81: 82: 83: 84: 85: 86: 87: 88: 89: 90: 91: 92: 93: } } break; case XMLEvent.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 49 Xerces document. This saves the overhead of creating all the internal data structures for each document. When you combine this with grammar caching, you can get some nice improvements in performance relative to creating a parser instance over and over again. Common Problems This section addresses some common problems that people encounter when they use Xerces.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 50 Chapter 1 ❑ Mismatched encoding declaration—The character encoding used in a file and the encoding name specified in the encoding declaration must match. The encoding declaration is the encoding="name" that appears after xml version="1.0" encoding="name"?> in an XML document. If the encoding of the file and the declared encoding don’t match, you may see errors about invalid characters.
01 543555 Ch01.qxd 11/5/03 9:40 AM Page 51 Xerces XML parsers have a place as a document and schema development tool. They provide the means for you to create XML documents and grammars in many forms (DTDs, XML Schema, and Relax-NG) and verify that the grammars you’ve written do what you want and that your documents conform to those grammars. The reality is that most developers are doing less with XML parsers directly.
01 543555 Ch01.