CS99I XML INFO

Abstract by Gio Wiederhold.

XML briefly

We describe only a few basic commands of the eXtended Markup Language (XML). The current common version is XML 1.0. XML formatted data are intended for processing, not just browsing. Browsing of XML is currently supported mainly by Internet Explorer 5.0.

Conventions

XML is an application conforming to a W3 group standard. XML uses embedded tags to indicate semantic structures, while leaving the interpretation of the tags to the client's application programs. The tags directives are bracketed by Less-Than (<) and Greater-Than (>) symbols. To enable this note to be printed in an HTML format we entered these symbols internally as (&lt; for < and &gt; for >). XML also uses square brackets.

The characterset for XML is Unicode, a standard covering the characters used for HTML, but also other alphabets. In fundamental form they start with an ampersand (&), are denoted by an integer, and end with a semicolon(;). However, the ASCII character code we use is an accepted default subset.

General layout

Each document should start with a process instruction
<?xml version 1.0?>,
here indicating that the document conforms to XML version 1.0, followed by the tagged root-name of the document. A simple document showing an XML document with 4 early Hitchcock movies (requires Microsoft Internet Explorer 4.+) uses tags suitable for movies, its body is hence called <movies>.
All commands have a corresponding closure, for instance there should be a and tag </movies> at the end of the document. If the contents is empty, an abbreviation is allowed combining both tags: <no-content/>.

The only content permitted in XML are character strings (called CDATA). No quotes are needed for CDATA. An example of a large document, publicly available, is Shakespeare's Taming of the Shrew. Internet Explorer will simply show it with all the markups, since we have not defined a conversion for presentation for it. But you can see how how the meangful tags now allow searching for items as the characters (<PERSONA>) in the play. All the tags used are listed in the accompaning Data Description.

The Processing Schema -- DTD

It is important to document what elements can appear in an XML document, and how they are to be arranged. That Notation is called a Data Type Definition (DTD). We show a sample DTD used for plays of Shakespeare. It uses some of symbols (*,+,?) encountered when presenting regular expressions in the notes. Notice that the tags are now related to the subject matter, and semantically meaningful for people who understand plays. However, <FM> and <P+> are mysteries until one looks at the contents; they denote 1 or more lines of comments entered by the people who created the XML version of the play.

Looking at an XML document

Since XML does not specify how documents are to be presented, that task is left up to the applications that read XML files. Without any program, our browsers (currently Microsoft IE explorer 4.0 or better 5.0) simply show the XML source document, converted manually to an html representation (a list of 4 Hitchcock movies) as a simple list, with all the tags explicitly shown. ( See list of Hitchcock movies, in XML form).

To obtain well-formatted visible output XML data must be converted to HTML, but should be done automatically. Such a conversion could be done for any specific XML file by any program that understands the tags, and accordingly creates a suitable HTML file. Such a program can be written in JAVA, and that allows its execution on the client computer (see Notes for meeting 3). Such a JAVA applet is available in the Microsoft Internet Explorer as XMLDSO (XML Data Source Objects). You can see a simple example of it's use on the 4 Hitchcock and on the full list can be seen. Such a longish list gets to be awkward.
For more detail, see Microsoft Data Binding Documentation (Jan.2000).

To create prettier outputs, we have to generate fancier HTML output. For instance, to split the long table for all of Hitchcock's films by category we defined a table in the program. We can also rearrange fields.

Long tables take long to load, and are hard to manipulate. We can limit the size if the table to be shown with a
<TABLE DATAPAGESIZE=8 ID=table WIDTH=100% datasrc=#xmldso>
specification. To allow manipulation of that table we add a provision for
<INPUT TYPE="button" VALUE="Next" ONCLICK="table.nextPage();">
of a button click, which refers to that table's ID. Now we can look at Hitchcock's films page by page. This example shows all Hitchcock movies, but not the directors heading (file Hitch0.xml).
Testing to include the directors heading HTML 2-level table constructed by hand; (file Hitch.xml) for Hitchcock's films page by page.

This formatting is created by the XML client application, here the HTML program we have written. An alternate choice is to provide formatting by a server, a style sheet can be specified.

XSL style sheets

With a suitable style sheet fancy formatting can be generated. Style sheets were common in SGML, so that a publisher could determine the printing layout of a book. For XML file W3 has proposed a general meta language, the XML Stylesheet Language (XSL).

All tags that match the XSL specification, are then not shown. Examples show a formatted table with
four Hitchcock movies as well as all Hitchcock movies, formatted in multiple tables, using an XSL style sheet source; The XSL source has been converted to HTML for readability under any browser.

An XSL interpreter in included in Microsoft IE version 5.0. If no XSL is indicated, it will use a default style sheet, that is book-oriented. It recognizes the following tags:

    book?

(book format).

In a stylesheet layout styles, relative sizes, and colors can be indicated

Cross References

The ability to go to other documents exists also in XML, as in HTML, but differs in format.

XML Checkers

Notes

Contents of: Charles F. Goldfarb and Oaul Prescod: The XML Handbook, 3rd edition Prentice Hall, 2001.
XSL information
[W3C98]W3C: Extensible markup language. 1998.
[McGrath98] Sean McGrath: XML by Example: Building E-Commerce Applications; Prentice-Hall, Charles F. Goldfarb Series on Open Information Management, 1998.
See also the CS99I references.
For limitations see Madnick paper.