Datasheet

Serialization

Most of the classes included with Xerces focus on taking XML documents, extracting information out of

them, and passing that information on to your application via an API. Xerces also includes some classes

that help you with the reverse process—taking data you already have and turning it into XML. This pro-

cess is called serialization (not to be confused with Java serialization). The Xerces serialization API can

take a SAX event stream or a DOM tree and produce an XML 1.0 or 1.1 document. One major improve-

ment in XML 1.1 is that many more Unicode characters can appear in an XML 1.1 document; however,

this makes it necessary to have a separate serializer for XML 1.1. There are also serializers that can take

an XML document and serialize it using rules for HTML, XHTML, or even text files.

The org.apache.xml.serialize package includes five different serializers. All of them implement the inter-

faces org.apache.xml.serialize.Serializer and org.apache.xml.serialize.DOMSerializer as well as the

ContentHandler, DocumentHandler, and DTDHandler classes from org.xml.sax and the DeclHandler

and LexicalHandler classes from org.xml.sax.ext. The five serializers are as follows:

❑ XMLSerializer is used for XML 1.0 documents and, of course, obeys all the rules for XML 1.0.

❑ XML11Serializer outputs all the new Unicode characters allowed by XML 1.1. If the XML that

you’re outputting happens to be HTML, then you should use either the HTMLSerializer or the

XHTMLSerializer.

❑ HTMLSerializer is used to output a document as HTML. It knows which HTML tags can get by

without an end tag.

❑ XHTMLSerializer is used to output a document as XHTML, It serializes the document accord-

ing to the XHTML rules.

❑ TextSerializer outputs the element names and the character data of elements. It doesn’t output

the DOCTYPE, DTD, or attributes.

Here are some of the differences in formatting when outputting HTML:

❑ The HTMLSerializer defaults to an ISO-8859-1 output encoding.

❑ An empty attribute value is output as an attribute name with no value at all (not even the equals

sign). Also, attributes that are supposed to be URIs, as well as the content of the SCRIPT and

STYLE tags, aren’t escaped (embedded ", ', <, >, and & are left alone).

❑ The content of A and TD tags isn’t line-broken.

❑ Most importantly, the HTMLSerializer knows that not all tags are closed in HTML.

HTMLSerializer’s list of the tags that do not require closing is as follows: AREA, BASE, BASE-

FONT, BR, COL, COLGROUP, DD, DT, FRAME, HEAD, HR, HTML, IMG, INPUT, ISINDEX, LI,

LINK, META, OPTION, P, PARAM, TBODY, TD, TFOOT, TH, THEAD, and TR.

The XHTML serializer outputs HTML according to the rules for XHTML. These rules are:

❑ Element/attribute names are lowercase because case matters in XHTML.

❑ An attribute’s value is always written if the value is the empty string.

❑ Empty elements must have a slash (/) in an empty tag (for example, <br />).

❑ The content of the SCRIPT and STYLE elements is serialized as CDATA.

Chapter 1

01 543555 Ch01.qxd 11/5/03 9:40 AM Page 34