Question 6 :
What's a Document Type Definition (DTD) and where do I get one?
A DTD is a description in XML Declaration Syntax of a particular type or class of document. It sets out what names are to be used for the different types of element, where they may occur, and how they all fit together. (A question C.16, Schema does the same thing in XML Document Syntax, and allows more extensive data-checking.)
For example, if you want a document type to be able to describe Lists which contain Items, the relevant part of your DTD might contain something like this:
<!ELEMENT List (Item)+> <!ELEMENT Item (#PCDATA)>
This defines a list as an element type containing one or more items (that's the plus sign); and it defines items as element types containing just plain text (Parsed Character Data or PCDATA). Validators read the DTD before they read your document so that they can identify where every element type ought to come and how each relates to the other, so that applications which need to know this in advance (most editors, search engines, navigators, and databases) can set themselves up correctly. The example above lets you create lists like:
<List> <Item>Chocolate</Item> <Item>Music</Item> <Item>Surfingv</Item> </List>
(The indentation in the example is just for legibility while editing: it is not required by XML.)
A DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD and a validating editor means you can be certain that all documents of that particular type will be constructed and named in a consistent and conformant manner.
DTDs are not required for processing the tip in question Bwell-formed documents, but they are needed if you want to take advantage of XML's special attribute types like the built-in ID/IDREF cross-reference mechanism; or the use of default attribute values; or references to external non-XML files ('Notations'); or if you simply want a check on document validity before processing.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML/XML Web pages for pointers). Many of them can be downloaded and used freely; or you can write your own (see the question on creating your own DTD. Old SGML DTDs need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, but most popular SGML DTDs are already available in XML form.
The alternatives to a DTD are various forms of question C.16, Schema. These provide more extensive validation features than DTDs, including character data content validation.
Question 7 :
How do I write my own DTD?
You need to use the XML Declaration Syntax (very simple: declaration keywords begin with
<!ELEMENT Shopping-List (Item)+> <!ELEMENT Item (#PCDATA)>
It says that there shall be an element called Shopping-List and that it shall contain elements called Item: there must be at least one Item (that's the plus sign) but there may be more than one. It also says that the Item element may contain only parsed character data (PCDATA, ie text: no further markup).
Because there is no other element which contains Shopping-List, that element is assumed to be the 'root' element, which encloses everything else in the document. You can now use it to create an XML file: give your editor the declarations:
<?xml version="1.0"?> <!DOCTYPE Shopping-List SYSTEM "shoplist.dtd">
(assuming you put the DTD in that file). Now your editor will let you create files according to the pattern:
<Shopping-List> <Item>Chocolate</Item> <Item>Sugar</Item> <Item>Butter</Item> </Shopping-List>
It is possible to develop complex and powerful DTDs of great subtlety, but for any significant use you should learn more about document systems analysis and document type design. See for example Developing SGML DTDs: From Text to Model to Markup (Maler and el Andaloussi, 1995): this was written for SGML but perhaps 95% of it applies to XML as well, as XML is much simpler than full SGML—see the list of restrictions which shows what has been cut out.
Incidentally, a DTD file never has a DOCTYPE Declaration in it: that only occurs in an XML document instance (it's what references the DTD). And a DTD file also never has an XML Declaration at the top either. Unfortunately there is still software around which inserts one or both of these.
Question 8 :
I keep hearing about alternatives to DTDs. What's a Schema?
The W3C XML Schema recommendation provides a means of specifying formal data typing and validation of element content in terms of data types, so that document type designers can provide criteria for checking the data content of elements as well as the markup itself. Schemas are written in XML Document Syntax, like XML documents are, avoiding the need for processing software to be able to read XML Declaration Syntax (used for DTDs).
There is a separate Schema FAQ at http://www.schemavalid.comFAQ. The term 'vocabulary' is sometimes used to refer to DTDs and Schemas together. Schemas are aimed at e-commerce, data control, and database-style applications where character data content requires validation and where stricter data control is needed than is possible with DTDs; or where strong data typing is required. They are usually unnecessary for traditional text document publishing applications.
Unlike DTDs, Schemas cannot be specified in an XML Document Type Declaration. They can be specified in a Namespace, where Schema-aware software should pick it up, but this is optional:
<invoice id="abc123" xmlns="http://example.org/ns/books/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://acme.wilycoyote.org/xsd/invoice.xsd"> ... </invoice>
More commonly, you specify the Schema in your processing software, which should record separately which Schema is used by which XML document instance.
In contrast to the complexity of the W3C Schema model, Relax NG is a lightweight, easy-to-use XML schema language devised by James Clark (see http://relaxng.org/) with development hosted by OASIS. It allows similar richness of expression and the use of XML as its syntax, but it provides an additional, simplified, syntax which is easier to use for those accustomed to DTDs.
Question 9 :
How will XML affect my document links?
The linking abilities of XML systems are potentially much more powerful than those of HTML, so you'll be able to do much more with them. Existing href-style links will remain usable, but the new linking technology is based on the lessons learned in the development of other standards involving hypertext, such as TEI and HyTime, which let you manage bidirectional and multi-way links, as well as links to a whole element or span of text (within your own or other documents) rather than to a single point. These features have been available to SGML users for many years, so there is considerable experience and expertise available in using them. Currently only Mozilla Firefox implements XLink.
The XML Linking Specification (XLink) and the XML Extended Pointer Specification (XPointer) documents contain the details. An XLink can be either a URI or a TEI-style Extended Pointer (XPointer), or both. A URI on its own is assumed to be a resource; if an XPointer follows it, it is assumed to be a sub-resource of that URI; an XPointer on its own is assumed to apply to the current document (all exactly as with HTML).
An XLink may use one of #, ?, or |. The # and ? mean the same as in HTML applications; the | means the sub-resource can be found by applying the link to the resource, but the method of doing this is left to the application. An XPointer can only follow a #.
The TEI Extended Pointer Notation (EPN) is much more powerful than the fragment address on the end of some URIs, as it allows you to specify the location of a link end using the structure of the document as well as (or in addition to) known, fixed points like IDs. For example, the linked second occurrence of the word 'XPointer' two paragraphs back could be referred to with the URI (shown here with linebreaks and spaces for clarity: in practice it would of course be all one long string):
This means the first link element within the second paragraph within the answer in the element whose ID is hypertext (this question). Count the objects from the start of this question (which has the ID hypertext) in the XML source:
1. the first child object is the element containing the question ();
2. the second child object is the answer (the element);
3. within this element go to the second paragraph;
4. find the first link element.
Eve Maler explained the relationship of XLink and XPointer as follows:
XLink governs how you insert links into your XML document, where the link might point to anything (eg a GIF file); XPointer governs the fragment identifier that can go on a URL when you're linking to an XML document, from anywhere (eg from an HTML file).
[Or indeed from an XML file, a URI in a mail message, etc…Ed.]
David Megginson has produced an xpointer function for Emacs/psgml which will deduce an XPointer for any location in an XML document. XML Spy has a similar function.
Question 10 :
How do I include one XML file in another?
This works exactly the same as for SGML. First you declare the entity you want to include, and then you reference it by name:
<?xml version="1.0"?> <!DOCTYPE novel SYSTEM "/dtd/novel.dtd" [ <!ENTITY chap1 SYSTEM "mydocs/chapter1.xml"> <!ENTITY chap2 SYSTEM "mydocs/chapter2.xml"> <!ENTITY chap3 SYSTEM "mydocs/chapter3.xml"> <!ENTITY chap4 SYSTEM "mydocs/chapter4.xml"> <!ENTITY chap5 SYSTEM "mydocs/chapter5.xml"> ]> <novel> <header> ...blah blah... </header> &chap1; &chap2; &chap3; &chap4; &chap5; </novel>
The difference between this method and the one used for including a DTD fragment (see question D.15, 'How do I include one DTD (or fragment) in another?') is that this uses an external general (file) entity which is referenced in the same way as for a character entity (with an ampersand).
The one thing to make sure of is that the included file must not have an XML or DOCTYPE Declaration on it. If you've been using one for editing the fragment, remove it before using the file in this way. Yes, this is a pain in the butt, but if you have lots of inclusions like this, write a script to strip off the declaration (and paste it back on again for editing).