Create Your First XML Document

This is the second of a series of articles on XML, designed as a tutorial. If you’re doing the exercises as you read, by the time you finish this section you will have written a simple XML “family tree” document. Good luck! - Dwight Baer

A. Introduction and Review.. 1

B. Create Your First XML Document 2

C. Appendices. 4

Appendix 1 – The MSXML parser 4

Appendix 2 – The Strengths and Limitations of SGML. 5

Appendix 3 – The Advantages of XML Compared to HTML. 6

Appendix 4 – XML and The World Wide Web Consortium.. 6

Appendix 5 – Glossary. 7

A. Introduction and Review

To benefit from this tutorial, you should have a bit of an HTML background. You should also have read my Article #1, “XML Alphabet Soup” unless you already have a good idea of what XML is.

A few minutes from now, you will have:

q Reviewed the highlights of Article 1, “XML Alphabet Soup”.

q Created a 5-line XML document by pasting text from this article into Notepad and displaying it using Internet Explorer.

q Read as much as you care to of a number of appendices on topics such as “The Advantages of XML” and “XML vs. HTML”.

If you are a person who prefers to learn simply by doing, I’ve tried to meet your needs, in Section B, by separating out the practical steps you need to do, in order to create an XML document.

Actually, if all you want to do is create a simple XML document, this tutorial will be quite short. Most of this article is comprised of appendices (listed in the Table of Contents) to give you the background information you might want to know, in order to fully understand the importance of XML in the world of Information Technology today. The Glossary is included in this article and will be referenced in coming articles.

Review of a Few Highlights of “XML Alphabet Soup”:

XML is, in fact, “the next big thing” in computer technology. Some people compare it to the invention of relational databases. I have even heard of XML being compared to the invention of the PC!

Here are a few acronyms you probably already know:

q SGML – Standard Generalized Markup Language – the predecessor of XML.

q HTML – Hypertext Markup Language – a “distant cousin” of XML.

q XSL and XSLT – Extensible Markup Language and Extensible Markup Language Transformations – Use XSLT to re-purpose XML documents.

q DTD – Document Type Definition. This is the “dictionary” of an XML document. When two people or applications both have access to a DTD, then they can be certain with regard to the meaning of the data contained in an XML document using that DTD.

q XML was created to address the limitations of SGML and HTML.

Before long, it is predicted that technologies using XML will permeate the way we build applications, as well as the way we store and communicate data.

B. Create Your First XML Document

An XML document can be created in many ways:

q You can use a normal text editor, for example Notepad. Don’t use a word processor, such as Word, unless you plan to save your document as a text (.txt) file. One of the main strengths of XML is that any XML document is readable by any text editor. (A normal Word document is only readable using Word.)

q You can use an XML editor, such as XML Spy. A 30-day free trial version of XML Spy is available at xmlspy.com. Recommendation: For a “lightweight” text editor, try Textpad.

q Computer applications (software tools) can output XML. For example, the latest version of Adobe’s FrameMaker (FrameMaker 7) is capable of producing XML.

q An XML document can also be created from another XML document, or from a database.

So, open your text or XML editor, and type or copy/paste the following lines:

<?xml version="1.0" ?>

<family_root>

<father>Michael Eksamel</father>

<mother>Hannah Eksamel</mother>

<child>Emily Eksamel</child>

</family_root >

Save your document as “family1.xml”.

Now, open the document in Internet Explorer.

Your output should look something like:

Try clicking on the minus sign to the left of the <family_root> tag. Click again. This is Internet Explorer’s default style sheet, interpreting your XML document. It’s allowing you to collapse or expand the XML “paragraph” you’ve created.

What do I mean by “style sheet”? Well, in very simple terms, one of XML’s main distinctions is that it separates the description of the data from its format. XML describes the data. In the example above, there is no question that Michael Eksamel is the father in our family tree, because the markup around his name says it clearly: <father>. XML relies on other technologies, such as Cascading Style Sheets (CSS) or Extensible Stylesheet Language (XSL) to do the display. Internet Explorer has been built with a default style sheet to display your XML document as seen above.

Now let’s clarify a few things about this XML document, line by line:

1. <?xml version="1.0" ?>

This is the declaration. In theory, according to the W3C Recommendation, the declaration is optional. However, there is no practical reason to leave it out. It must be the first line of your XML document, with no white space or other characters above it.

2. <family_root> This is an opening tag. A tag begins with a left angle bracket and ends with a right angle bracket. The difference between an opening tag and a closing tag is that the closing tag begins with a forward slash after the left angle bracket.
(Notice the closing tag </family_root>). Notice that all the tags in this document have an opening tag and a corresponding closing tag.

Everything between the <family_root> opening tag and the </family_root> closing tag, is called an element.

<family_root> is also the root element. There must always be at least one, and only one, root element in an XML document. Notice that I called this root element “family_root”, just to clarify that this is the root element of our family tree example. However, I could just as well have called it <family> or even <f>.

3. <father>, <mother> and <child> tags – Notice that each of these “child” tags is completely contained (opened and closed) within its “parent” tag, <family_root>.

Every tag that is contained within another tag is called a “child” tag, whereas the tag that contains the “child” tag is called the “parent” tag.

In this example, I have thoughtfully used the analogy of the family tree. I hope you don’t become confused by too many analogies going on:

q In a family tree, we can visualize children being born from their parents. Similarly, in XML, every element that is contained within another element is called a “child” element. The containing element is called the “parent” element.

q From the analogy of the tree, we can visualize branches springing from the root of the tree. Similarly, in XML, there is one root element within which all other elements are contained.

4. I am going to say a lot more about elements in my next article, entitled “Create a Well-Formed XML Document”. For now, though, I’d just like to list the rules for tag (or element) names:

q An XML tag consists of a string delimited by a less-than sign (<) and a greater-than sign (>).

q The start tag and the end tag define the beginning and the end of each XML element.

q The name of an XML tag must conform to the following rules:

o A tag name must start with a letter (upper- or lower-case) or an underscore ( _ )

o A tag name may contain letters, numbers, underscores, dashes or periods.

o Tag names are case sensitive. <Mother> and <mother> are two different XML tags.

q A tag name must not begin with the letters XML.

q A tag name may contain the colon character, in theory, but this is not a good idea in practice. Why? Because of a special XML construct called a namespace, which uses the colon.

Wrapping Up: So … Did you successfully create an XML document? Did it display in Internet Explorer? Let me know if you have any questions, or even if you don’t, I’d be happy to hear from you. Dwight Baer

C. Appendices

Appendix 1 – The MSXML parser

What is MSXML?

MSXML is a set of services that provide functions for working with XML documents. A primary use of MSXML is to parse, generate, validate and transform XML documents so that the information can be displayed, stored, or manipulated.

How can you make your Internet Explorer browser become an XML parser and validator? The short answer is: Download Internet Explorer 6, then download MSXML Core Services 4 from the Microsoft site.

Procedure to install MSXML Core Services 4 (I’m taking you by a somewhat indirect route so you get to read the background information):

Click on the “MSXML Core Services 4” link above to go to the “What's New in the October 2001 Microsoft XML Core Services (MSXML) 4.0 Release” page.

Click on “MSDN Downloads”.

Click on “Installer”. Click Yes to accept the terms.

Save the file on your hard disk.

Run the executable you just downloaded.

Note 1: For an easy-to-follow procedure to install the MSXML parser including the download files, visit the Front-Runner site. Note: Front-Runner doesn’t necessarily promise you the latest version of the parser, but it does work with Internet Explorer 5.5 and 6.0.

Note 2: You must uninstall previous versions of MSXML before you install MSXML 4.

Note 3: You must have Windows Installer 2.0 in order for this install to work. If you need to install it, here is the overview, the download for Windows 98 and ME and the download for Windows NT and 2000.

Follow these steps to verify the version of the Windows Installer that is on your computer:

q Locate Msi.dll on your computer. By default, this file is in the \Windows\System folder on Windows 95 and Windows 98 computers, or the \Winnt\System32 folder on Windows NT 4.0 computers.

q Right-click Msi.dll and then click Properties .

q Click the Version tab.

Note 4: In the end, if you’re not sure what version of MSXML you’re using, check out Microsoft Knowledge Base article Q296647. You will need to download “filemon.exe”, which is a very interesting utility to help you determine file dependencies.

Here’s a mystery to solve: After I installed the latest (as of April, 2002) MSXML4 Core Services, I found that MSXML4.dll was indeed installed on my system but there was a problem. I tried this on both Windows 98 and Windows 2000 Serve, and on both systems, filemon.exe (described above) reported MSXML3.dll as the executable that was being used when I would open an xml file in Internet Explorer. I had tried filemon.exe before and it was indeed using MSXML4.dll, but it seems to insist on using MSXML3.dll with the latest version of IE and MSXML Core Services. Please e-mail me to let me know if you find an answer to this question.

Microsoft has been a strong supporter of XML even before it became “official” in 1998. However, the extent of XML support in Internet Explorer has been a story of leapfrog. Each iteration of Internet Explorer included support for an older version of the XML parser, MSXML. As a result, even Internet Explorer 5.5 has “incomplete” XML support, compared to our current requirements. For most people, all you need to know is in the previous paragraph (download Internet Explorer 6 and MSXML Core Services 4).

Appendix 2 – The Strengths and Limitations of SGML

SGML (Standard Generalized Markup Language) became an ISO standard in 1986 and, like XML, allows developers to create their own markup languages. SGML is like a father to HTML, and more like a big brother to XML. It offers a high degree of control over the structure of a document.

Some of the strengths of SGML, all of which are also strengths of XML:

q It is platform independent. You are not tied to using Unix, Windows, MacOS, or any other operating system.

q It uses a text-based markup, i.e. tags, which are readable by any standard text editor.

q There is no pre-defined set of markup instructions. SGML is a meta-language, which allows you to create your own markup language.

q SGML separates appearance and content. By doing this, the information marked up using SGML is reusable. One document can be re-formatted for many different outputs.

The limitations of SGML:

q SGML is too powerful and too big to be used in a Web browser.

q It is so generalized and the specification is so difficult to implement that SGML applications are prohibitively expensive to build and maintain.

q There is a joke that the acronym “SGML” stands for “Sounds Good, Maybe Later”!

q XML was written specifically in order to be a pared-down, lightweight version of SGML.

Appendix 3 – The Advantages of XML Compared to HTML

XML	HTML
You can sort information in the browser, without accessing a database	Inability to sort on the fly
You can separate parts of a document or pull parts of it from external sources	Little ability to separate portions of a document
XSL, XSLT and CSS are used for formatting	No capabilities for advanced formatting (e.g. chemical symbols)
Content- and structure-based	Display- or format-based
Unlimited types of tags, created and defined by the developer of the XML document	Limited set of tags, defined by W3C, implemented haphazardly by browser creators.
Strict rules governing syntax	Lax syntax rules
Hierarchical structure in which each element is nested properly	Proper nesting is not required
Attributes must be surrounded by quotes	Quoted attributes are not required
Entities must be declared in a DTD	DTDs are not part of the HTML specification. (HTML itself could be seen as a DTD).
Tags are case-sensitive	Tags are not case-sensitive

Appendix 4 – XML and The World Wide Web Consortium

The World Wide Web Consortium, otherwise known as the W3C, is the organization that created what is known as the XML Recommendation, which has experienced near universal acceptance since its first publication in 1998.

The W3C was established in 1994 as a result of the effort of Tim Berners-Lee, who in 1990 had written HTML (Hypertext Markup Language). The W3C has published more than 35 Recommendations since its inception. Each Recommendation not only builds on the previous one, but is designed so that it may be integrated with future specifications as well. The W3C is transforming the architecture of the initial Web (essentially HTML, URIs, and HTTP) into the architecture of tomorrow's Web, built atop the solid foundation provided by XML.

Here is a diagrammatic representation of the relationship of XML to a few of the other W3C recommendations:

Source: About the World Wide Web Consortium.

The W3C's mission is to lead the Web to its full potential. Some of its operating principles are: Universal access, trust, interoperability, and decentralization. See W3C in Seven Points.

Appendix 5 – Glossary

Sources:

http://www.xmlmag.com/upload/free/features/xml/1999/01win99/glwin99/glwin99.asp and

http://www.gca.org/whats_xml/whats_xml_glossary.htm

attribute: An attribute is a property of an element. Often attributes are used to pass information about the element and hence can be said to provide metadata for the element. Attributes, a value indicator (=) and the attribute value are specified within an XML tag (i.e. <a href="http://www.w3.org/TR">).

authXML: AuthXML is a vendor-neutral standard that enables integration of Web security, network security, B2B infrastructures and applications. AuthXML is named as such because it comprises 2 primary components: Authentication and Authorization and is designed to ease integration of transactions between trading partner sites that may be using different security systems and within a given site that may be deploying multiple applications that need integrated security.

BizTalk: An industry initiative started by Microsoft and supported by a wide range of organizations. BizTalk is a community of standards users, whose goal is the adoption of XML in electronic commerce and application integration through the BizTalk Framework, a set of guidelines for how to publish schemas in XML.

character: A character is an atomic unit of text as specified by ISO/IEC 10646. A character is a single alpha, numeric, or punctuation mark.

character data: All text characters that are not markup characters make up the character data of the document.

child element: An element contained within another element is known as a child element. The element containing other elements is known as the parent element.

CSS (Cascading Style Sheets): A means of defining certain document elements (paragraphs, headings, fonts, colors, positioning, backgrounds) with style rules instead of additional markup tags.

declaration: XML documents should begin with an XML declaration which specifies the version of XML being used.

DOCTYPE (Document Type Definition (DTD)): The purpose of a Document Type Definition is to define the legal building blocks of an XML document. It defines the legal structure with a list of legal elements. A DTD can be declared inline in your XML document, or as an external reference.

DOM (Document Object Model): A platform- and language-neutral interface that allows scripts and programs to access and update dynamically the content, structure, and style of documents. It provides a standard set of objects for representing HTML and XML documents, a model of how these objects can be combined, and an interface for accessing and manipulating them.

delimiter: A delimiter is a special character that marks the beginning and end of a string or text field.

document: A document is a class of data object. A document in XML may be the text of a printed work. It may also be a set of database records.

DOM: Document Object Model. See W3C Standards.

DTD (Document Type Definition): The rules that define the tags that can be used in an XML file and their valid values.

EDI (Electronic Data Interchange): The electronic communication of business transactions between organizations. XML complements EDI because it can be used to exchange e-commerce information.

element: An element is a logical data structure within an XML document. In XML, start tags and end tags show the beginning and end of an element.

empty element: Not all elements have content. Those elements that do not have content are empty elements and in XML may be noted with a special empty element tag that ends with a slash directly preceding the closing angle bracket of the tag (<empty/>).

entity: An entity in XML is a virtual storage unit. An XML entity is often a separate file, but may be a string or even a database record.

generic identifier: A generic identifier, often called the "GI" is the XML tag name. So <head> has a generic identifier equal to "head". A generic identifier is unique in its namespace.

XML (Guideline XML): A file structure supported by EDI software company Edifecs Commerce that allows the open exchange of electronic commerce guidelines.

HTML (HyperText Markup Language): A nonproprietary methodology for creating Web pages. HTML defines the page layout, fonts, graphic elements, and hypertext links to other Web documents by embedding tags (codes) within the text.

IETF (Internet Engineering Task Force): An organization of working groups that identifies problems and proposes technical solutions for the Internet. They publish XML-related RFCs (Requests for Comments) and specifications.

LT XML: An integrated set of C++ and Java-based XML tools from the Language Technology Group for processing XML documents.

Layman-Bray: A proposal for XML namespaces (groups of names defined according to some naming convention) that ensures that names remain unambiguous even if chosen by more than one author.

markup characters: A markup character is a text character that identifies the storage and logical structures of the data. Tags and entities are markup characters of an XML document.

MathML (Mathematical Markup Language): An XML methodology for describing mathematical notations on the Web, just as HTML does for ordinary text.

Metadata: Data that describes other data. Metadata about an XML document is described in the DTD or in the XML document itself, enabling other applications to interact with it.

Metalanguage: A language that describes other languages. SGML and XML can be considered metalanguages because they define markup languages.

namespace: A namespace is a set of unique identifiers.

Notations identify by name the format of unparsed entities, the format of elements which bear a notation attribute, or the application to which a processing instruction is addressed.

Notation declarations provide a name for the notation, for use in entity and attribute-list declarations and in attribute specifications, and an external identifier for the notation which may allow an XML processor or its client application to locate a helper application capable of processing data in the given notation. http://www.w3.org/TR/REC-xml#dt-notation

OASIS (Organization for Advancement of Structured Information Systems): A consortium of companies and individuals that collects and publishes XML specifications, DTDs, and schemas. By standardizing specifications, OASIS hopes to advance the open interchange of documents and structured information objects.

OSD (Open Software Description) Format: An XML-based specification designed by Microsoft and Marimba to automate software distribution. OSD uses unique XML tags to describe software packages.

parent element: An element containing other elements is known as the parent element. The elements contained within the parent element are known as child elements.

parser, XML : An XML parser is a processor that reads an XML document and determines the structure and properties of the data. If the parser goes beyond the XML rules for well-formedness and validates the document against an XML DTD, the parser is said to be a "validating" parser.

Processing instructions (PIs) allow documents to contain instructions for applications. PIs are not part of the document's character data, but must be passed through to the application. The PI begins with a target (PITarget) used to identify the application to which the instruction is directed. The target names "XML", "xml", and so on are reserved for standardization in this or future versions of this specification. The XML Notation mechanism may be used for formal declaration of PI targets. Parameter entity references are not recognized within processing instructions.

RDF (Resource Description Framework): A model for describing and interchanging metadata. It allows a Web site to describe its dynamic (user-created) content without having to store static pages that contain that content.

RFC (Request for Comments): A document used by the IETF to describe the specifications for a recommended technology.

root element: Every XML document has one element that contains all other elements of the document. The root element is also called the document element.

Schema: A system of representing a data model that defines the data's elements and attributes, and the relationship among elements.

SGML (Standard Generalized Markup Language): The "mother of all markup languages," it's a metalanguage used to construct other markup languages. XML is designed to be "an extremely simple dialect of SGML" (per the W3C XML specs) for the Web.

SMIL (Synchronized Multimedia Integration Language): A language designed to integrate multimedia objects into a synchronized presentation.

SVG - Scalable Vector Graphics is a vector graphics language written in XML. Using SVG, graphics can be coded directly into an XML document. Benefits of SVG include:

q smaller file size than regular bitmapped graphics such as GIF and JPEG files.

q resolution independence, so that the image can scale down or up to fit proportionally into any size display on any type of Net device

q text labels and descriptions for searchability

q ability to link to parts of an image

q complex animation

Stylesheet: A stylesheet is a rule or sequence of rules that affect the appearance and/or structure of a document.

Tags: Tags are text structures that mark the beginning and end of elements within the XML document. Tags are markup characters.

Unicode: A superset of the ASCII character set, this 16-bit character encoding scheme includes not only the standard Roman and Greek alphabets, but also mathematical symbols, special punctuation, and non-Roman character sets (Hebrew, Chinese, etc.).

URI (Uniform Resource Identifier): The addressing technology by which URLs (Uniform Resource Locators) are created. Technically, http:// and ftp:// are specific subsets of a URI.

valid: An XML document with an associated document type declaration that follows all the rules of that declaration is valid.

well-formed: A well-formed XML document follows all the rules of the XML specification but is not necessarily valid according to an associated document type declaration.

XFRML (Extensible Financial Reporting Markup Language): The new "digital language of business" supported and proposed by the American Institute of Certified Public Accountants, which allows the financial community to exchange and analyze a variety of financial reports. Still a work in progress.

XHTML (Extensible HyperText Markup Language): The "XML-ization of HTML"—essentially the "newest version" of HTML, which extends its functionality to support a wider range of devices and applications.

XLink: A package of hyperlinking functionality that comes in two parts. "XLink" governs how links are inserted into an XML document; "XPointer" determines the identifier that goes on a URL when linking to an XML document from somewhere else, such as another Web page. Formerly known as XLL (Extensible Linking Language).

XLL (Extensible Linking Language): The standard for describing links among objects in XML documents. (See XLink.)

XMI (XML Metadata Interchange): An open information interchange model intended to give developers working with object technology the ability to exchange programming data over the Internet in a standardized way, bringing consistency and compatibility to applications created in collaborative environments. XMI is intended to be either stored in a traditional file system or streamed across the Internet from a database or repository.

XML (Extensible Markup Language): A data format for structured document interchange that is more flexible than HTML. While HTML's tags are predefined, XML allows tags to be defined by the developer of the page. Thus, XML-defined Web pages can function like database records.

XML aware: Any software application that recognizes the XML data format and understands XML concepts. Often XML aware software contains an embedded XML parser.

XML declaration: An XML declaration is an optional declaration at the top of an XML document that specifies the version of XML and an encoding declaration.

XML dialect: Any "flavor" of XML defined by a DTD that is designed to support a specialized purpose, such as BIOML (BIOpolymer Markup Language), CML (Chemical Markup Language), MathML, CDF, TalkML (an experimental XML for voice browsers), XFRML, etc.

XML editors: Software that allows basic data/metadata editing functions and explicit control over XML markup. Products run the gamut from simple editors for small documents, such as Language Technology Group's XED, to more full-featured XML "word processors," such as Icon's XML Spy, Vervet Logic's XML Pro, and SoftQuad's XMetaL.

XML entities: Special sets of characters that help expand document content without increasing the overall character count. Internal entities act as typing shortcuts or macros; external entities incorporate content from outside sources into the main document.

XML namespace: A way of defining each element type and attribute name in an XML document unambiguously (through associations with specific URIs) so that two or more XML-based languages may be used in that document without creating a conflict.

XML processor: A software module that reads XML documents and provides access to their content and structure. The processor does its work on behalf of another module, called the application. The processor reads the XML data and provides the application with the information.

XML vocabulary: An XML vocabulary is an XML tag set with a specific functionality. SMIL, WIDL, MathML, and ICE are all examples of XML vocabularies.

XML-QL (XML Query Language): A query language for XML, which, like SQL, has a SELECTWHERE construct and uses features of query languages developed for semi-structured data. XML-QL is a competing proposal to XPath, but is not likely to be adopted as a recommendation by the W3C.

XMOP (XML Metadata Object Persistence): A set of components that allows the interoperation between object technologies such as Java, Microsoft COM, and CORBA. This means that objects can be transported between different object systems (COM and Java) and different Java VMs (Microsoft's and Sun's).

XPath (XML Path Language): A way of referencing information within an XML document, intended as a bridge between XPointer and XSLT. XPath uses a directory notation to perform queries through the selectNodes architecture and lets you determine which elements within an XML document satisfy a given set of criteria.

XSL (Extensible Style Language): The style standard for XML. Like CSS, it specifies the presentation and appearance of an XML document.

XSLT (XSL Transformations Language): A language used to transform (reformat) XML documents into other XML documents. XSLT supports both push and pull transformations and is designed to be used independently of XSL; however, it is not intended to function as a general-purpose XML transformation language.