Create Your First XML Document
|
This is the
second of a series of articles on XML, designed as a tutorial. If you’re doing the exercises as you read,
by the time you finish this section you will have written a simple XML “family
tree” document. Good luck! - Dwight Baer
B. Create Your First XML Document
Appendix
2 – The Strengths and Limitations of SGML
Appendix
3 – The Advantages of XML Compared to HTML
Appendix
4 – XML and The World Wide Web Consortium
To benefit from this
tutorial, you should have a bit of an HTML background. You should also have read my Article #1,
“XML Alphabet Soup” unless you already have a good idea of what XML is.
A few minutes from
now, you will have:
q
Reviewed the
highlights of Article 1, “XML Alphabet Soup”.
q
Created a
5-line XML document by pasting text from this article into Notepad and
displaying it using Internet Explorer.
q
Read as much as
you care to of a number of appendices on topics such as “The Advantages of XML”
and “XML vs. HTML”.
If you are a person who prefers to learn simply by doing, I’ve tried to meet your needs, in Section B, by separating out the practical steps you need to do, in order to create an XML document.
Actually, if all you want to do is create a simple XML document, this tutorial will be quite short. Most of this article is comprised of appendices (listed in the Table of Contents) to give you the background information you might want to know, in order to fully understand the importance of XML in the world of Information Technology today. The Glossary is included in this article and will be referenced in coming articles.
Review
of a Few Highlights of “XML Alphabet Soup”:
XML is, in fact,
“the next big thing” in computer technology.
Some people compare it to the invention of relational databases. I have even heard of XML being compared to
the invention of the PC!
Here are a few
acronyms you probably already know:
q
SGML – Standard
Generalized Markup Language – the predecessor of XML.
q
HTML –
Hypertext Markup Language – a “distant cousin” of XML.
q
XSL and XSLT – Extensible Markup Language and
Extensible Markup Language Transformations – Use XSLT to re-purpose XML
documents.
q
DTD – Document
Type Definition. This is the
“dictionary” of an XML document. When
two people or applications both have access to a DTD, then they can be certain
with regard to the meaning of the data contained in an XML document using that
DTD.
q
XML was created
to address the limitations of SGML and HTML.
Before long, it is
predicted that technologies using XML will permeate the way we build
applications, as well as the way we store and communicate data.
An XML document can
be created in many ways:
q
You can use a
normal text editor, for example Notepad.
Don’t use a word processor, such as Word, unless you plan to save your
document as a text (.txt) file. One of
the main strengths of XML is that any XML document is readable by any text
editor. (A normal Word document is only
readable using Word.)
q
You can use an
XML editor, such as XML Spy. A 30-day
free trial version of XML Spy is available at
xmlspy.com. Recommendation: For a “lightweight” text editor, try Textpad.
q
Computer
applications (software tools) can output XML.
For example, the latest version of Adobe’s FrameMaker (FrameMaker 7)
is capable of producing XML.
q
An XML document
can also be created from another XML document, or from a database.
So, open your text
or XML editor, and type or copy/paste the following lines:
<?xml version="1.0" ?>
<family_root>
<father>Michael
Eksamel</father>
<mother>Hannah
Eksamel</mother>
<child>Emily
Eksamel</child>
</family_root >
Save your document
as “family1.xml”.
Now, open the
document in Internet Explorer.
Your output should
look something like:
Try clicking on the
minus sign to the left of the <family_root> tag. Click again. This is Internet Explorer’s default style sheet, interpreting
your XML document. It’s allowing you to
collapse or expand the XML “paragraph” you’ve created.
What do I mean by
“style sheet”? Well, in very simple
terms, one of XML’s main distinctions is that it separates the description of
the data from its format. XML describes
the data. In the example above, there
is no question that Michael Eksamel is the father in our family tree, because
the markup around his name says it clearly:
<father>. XML relies on
other technologies, such as Cascading Style Sheets (CSS) or Extensible
Stylesheet Language (XSL) to do the display.
Internet Explorer has been built with a default style sheet to display
your XML document as seen above.
Now let’s clarify a
few things about this XML document, line by line:
1. <?xml
version="1.0" ?>
This is the
declaration. In theory, according to
the W3C Recommendation, the declaration is optional. However, there is no practical reason to leave it out. It must be the first line of your XML
document, with no white space or other characters above it.
2.
<family_root> This is an
opening tag. A tag begins with a left
angle bracket and ends with a right angle bracket. The difference between an opening tag and a closing tag is that
the closing tag begins with a forward slash after the left angle bracket.
(Notice the closing tag </family_root>).
Notice that all the tags in this document have an opening tag and a
corresponding closing tag.
Everything between the <family_root> opening tag and the
</family_root> closing tag, is
called an element.
<family_root> is also the root element.
There must always be at least one, and only one, root element in an XML
document. Notice that I called this
root element “family_root”, just to clarify that this is the root element of
our family tree example. However, I
could just as well have called it <family> or even <f>.
3. <father>,
<mother> and <child> tags – Notice that each of these “child”
tags is completely contained (opened and closed) within its “parent” tag,
<family_root>.
Every tag that is contained within another tag is called a “child” tag, whereas
the tag that contains the “child” tag is called the “parent” tag.
In this example, I have thoughtfully used the analogy of the family tree. I hope you don’t become confused by too many
analogies going on:
q
In a family
tree, we can visualize children being born from their parents. Similarly, in XML, every element that is
contained within another element is called a “child” element. The containing element is called the
“parent” element.
q
From the
analogy of the tree, we can visualize branches springing from the root
of the tree. Similarly, in XML, there
is one root element within which all other elements are contained.
4. I am going to say a lot more about elements in my next article, entitled “Create a Well-Formed XML Document”. For now, though, I’d just like to list the rules for tag (or element) names:
q An XML tag consists of a string delimited by a less-than sign (<) and a greater-than sign (>).
q The start tag and the end tag define the beginning and the end of each XML element.
q The name of an XML tag must conform to the following rules:
o A tag name must start with a letter (upper- or lower-case) or an underscore ( _ )
o A tag name may contain letters, numbers, underscores, dashes or periods.
o Tag names are case sensitive. <Mother> and <mother> are two different XML tags.
q A tag name must not begin with the letters XML.
q A tag name may contain the colon character, in theory, but this is not a good idea in practice. Why? Because of a special XML construct called a namespace, which uses the colon.
Wrapping Up: So … Did you successfully create an XML document? Did it display in Internet Explorer? Let me know if you have any questions, or even if you don’t, I’d be happy to hear from you. Dwight Baer
What is MSXML?
MSXML is a set of
services that provide functions for working with XML documents. A primary use
of MSXML is to parse, generate, validate and transform XML documents so that
the information can be displayed, stored, or manipulated.
How can you make your Internet Explorer browser become an XML parser and validator? The short answer is: Download Internet Explorer 6, then download MSXML Core Services 4 from the Microsoft site.
Procedure to install MSXML Core Services 4 (I’m taking you by a somewhat indirect route so you get to read the background information):
Click on the “MSXML Core Services 4” link above to go to
the “What's New in the October 2001 Microsoft XML Core Services (MSXML) 4.0
Release” page.
Click on “MSDN Downloads”.
Click on “Installer”.
Click Yes to accept the terms.
Save the file on your hard disk.
Run the executable you just downloaded.
Note 1: For an easy-to-follow procedure to install the MSXML parser including the download files, visit the Front-Runner site. Note: Front-Runner doesn’t necessarily promise you the latest version of the parser, but it does work with Internet Explorer 5.5 and 6.0.
Note 2: You must uninstall previous versions of MSXML before you install MSXML 4.
Note 3: You must have Windows Installer 2.0 in order for this install to work. If you need to install it, here is the overview, the download for Windows 98 and ME and the download for Windows NT and 2000.
Follow
these steps to verify the version of the Windows Installer that is on your
computer:
q Locate Msi.dll on your computer.
By default, this file is in the \Windows\System folder on Windows 95 and
Windows 98 computers, or the \Winnt\System32 folder on Windows NT 4.0
computers.
q Right-click Msi.dll and
then click Properties .
q Click the Version tab.
Note 4: In the end, if you’re not sure what version of MSXML you’re using, check out Microsoft Knowledge Base article Q296647. You will need to download “filemon.exe”, which is a very interesting utility to help you determine file dependencies.
Here’s a mystery to solve: After I installed the latest (as of April, 2002) MSXML4 Core Services, I found that MSXML4.dll was indeed installed on my system but there was a problem. I tried this on both Windows 98 and Windows 2000 Serve, and on both systems, filemon.exe (described above) reported MSXML3.dll as the executable that was being used when I would open an xml file in Internet Explorer. I had tried filemon.exe before and it was indeed using MSXML4.dll, but it seems to insist on using MSXML3.dll with the latest version of IE and MSXML Core Services. Please e-mail me to let me know if you find an answer to this question.
Microsoft has been a strong supporter of XML even before it became “official” in 1998. However, the extent of XML support in Internet Explorer has been a story of leapfrog. Each iteration of Internet Explorer included support for an older version of the XML parser, MSXML. As a result, even Internet Explorer 5.5 has “incomplete” XML support, compared to our current requirements. For most people, all you need to know is in the previous paragraph (download Internet Explorer 6 and MSXML Core Services 4).
SGML (Standard Generalized Markup Language) became an ISO standard in 1986 and, like XML, allows developers to create their own markup languages. SGML is like a father to HTML, and more like a big brother to XML. It offers a high degree of control over the structure of a document.
Some of the strengths of SGML, all of which are also strengths of XML:
q It is platform independent. You are not tied to using Unix, Windows, MacOS, or any other operating system.
q It uses a text-based markup, i.e. tags, which are readable by any standard text editor.
q There is no pre-defined set of markup instructions. SGML is a meta-language, which allows you to create your own markup language.
q SGML separates appearance and content. By doing this, the information marked up using SGML is reusable. One document can be re-formatted for many different outputs.
The limitations of SGML:
q SGML is too powerful and too big to be used in a Web browser.
q It is so generalized and the specification is so difficult to implement that SGML applications are prohibitively expensive to build and maintain.
q There is a joke that the acronym “SGML” stands for “Sounds Good, Maybe Later”!
q XML was written specifically in order to be a pared-down, lightweight version of SGML.
XML |
HTML |
You can sort information in the browser, without accessing a database |
Inability to sort on the fly |
You can separate parts of a document or pull parts of it from external sources |
Little ability to separate portions of a document |
XSL, XSLT and CSS are used for formatting |
No capabilities for advanced formatting (e.g. chemical symbols) |
Content- and structure-based |
Display- or format-based |
Unlimited types of tags, created and defined by the developer of the XML document |
Limited set of tags, defined by W3C, implemented haphazardly by browser creators. |
Strict rules governing syntax |
Lax syntax rules |
Hierarchical structure in which each element is nested properly |
Proper nesting is not required |
Attributes must be surrounded by quotes |
Quoted attributes are not required |
Entities must be declared in a DTD |
DTDs are not part of the HTML specification. (HTML itself could be seen as a DTD). |
Tags are case-sensitive |
Tags are not case-sensitive |
The World Wide Web Consortium, otherwise known as the W3C, is the organization that created what is known as the XML Recommendation, which has experienced near universal acceptance since its first publication in 1998.
The W3C was established in 1994 as
a result of the effort of Tim
Berners-Lee, who in 1990 had written HTML (Hypertext Markup Language). The W3C has published more than 35 Recommendations since its
inception. Each Recommendation not only
builds on the previous one, but is designed so that it may be integrated with
future specifications as well. The W3C is transforming the architecture of the
initial Web (essentially HTML, URIs, and HTTP) into the architecture of
tomorrow's Web, built atop the solid foundation provided by XML.
Here is a diagrammatic representation of the relationship of XML to a few of the other W3C recommendations:
Source: About the World Wide Web
Consortium.
The W3C's mission is to lead the Web to its full
potential. Some of its operating principles are: Universal
access, trust, interoperability, and decentralization. See W3C in Seven Points.
Sources:
http://www.xmlmag.com/upload/free/features/xml/1999/01win99/glwin99/glwin99.asp and
http://www.gca.org/whats_xml/whats_xml_glossary.htm
attribute: An attribute is a property of an element. Often attributes are used to pass information about the element and hence can be said to provide metadata for the element. Attributes, a value indicator (=) and the attribute value are specified within an XML tag (i.e. <a href="http://www.w3.org/TR">).
authXML: AuthXML is a vendor-neutral standard that enables integration of Web security, network security, B2B infrastructures and applications. AuthXML is named as such because it comprises 2 primary components: Authentication and Authorization and is designed to ease integration of transactions between trading partner sites that may be using different security systems and within a given site that may be deploying multiple applications that need integrated security.
BizTalk: An industry initiative started by Microsoft and supported by a wide range of organizations. BizTalk is a community of standards users, whose goal is the adoption of XML in electronic commerce and application integration through the BizTalk Framework, a set of guidelines for how to publish schemas in XML.
character: A character is an atomic unit of text as specified by ISO/IEC 10646. A character is a single alpha, numeric, or punctuation mark.
character data: All text characters that are not markup characters make up the character data of the document.
child element: An element contained within another element is known as a child element. The element containing other elements is known as the parent element.
CSS (Cascading Style Sheets): A means of defining certain document elements (paragraphs, headings, fonts, colors, positioning, backgrounds) with style rules instead of additional markup tags.
declaration: XML
documents should begin with an XML declaration which specifies the
version of XML being used.
DOCTYPE (Document Type Definition (DTD)): The purpose of a Document Type
Definition is to define the legal building blocks of an XML document. It defines the legal structure with a list
of legal elements. A DTD can be
declared inline in your XML document, or as an external reference.
DOM (Document Object Model): A platform- and language-neutral interface that allows scripts and programs to access and update dynamically the content, structure, and style of documents. It provides a standard set of objects for representing HTML and XML documents, a model of how these objects can be combined, and an interface for accessing and manipulating them.
delimiter: A delimiter is a special character that marks the beginning and end of a string or text field.
document: A document is a class of data object. A document in XML may be the text of a printed work. It may also be a set of database records.
DOM: Document Object Model. See W3C Standards.
DTD (Document Type Definition): The rules that define the tags that can be used in an XML file and their valid values.
EDI (Electronic Data Interchange): The electronic communication of business transactions between organizations. XML complements EDI because it can be used to exchange e-commerce information.
element: An element is a logical data structure within an XML document. In XML, start tags and end tags show the beginning and end of an element.
empty element: Not all elements have content. Those elements that do not have content are empty elements and in XML may be noted with a special empty element tag that ends with a slash directly preceding the closing angle bracket of the tag (<empty/>).
entity: An entity in XML is a virtual storage unit. An XML entity is often a separate file, but may be a string or even a database record.
generic identifier: A generic identifier, often called the "GI" is the XML tag name. So <head> has a generic identifier equal to "head". A generic identifier is unique in its namespace.
XML (Guideline XML): A file structure supported by EDI software company Edifecs Commerce that allows the open exchange of electronic commerce guidelines.
HTML (HyperText Markup Language): A nonproprietary methodology for creating Web pages. HTML defines the page layout, fonts, graphic elements, and hypertext links to other Web documents by embedding tags (codes) within the text.
IETF (Internet Engineering Task Force): An organization of working groups that identifies problems and proposes technical solutions for the Internet. They publish XML-related RFCs (Requests for Comments) and specifications.
LT XML: An integrated set of C++ and Java-based XML tools from the Language Technology Group for processing XML documents.
Layman-Bray: A proposal for XML namespaces (groups of names defined according to some naming convention) that ensures that names remain unambiguous even if chosen by more than one author.
markup characters: A markup character is a text character that identifies the storage and logical structures of the data. Tags and entities are markup characters of an XML document.
MathML (Mathematical Markup Language): An XML methodology for describing mathematical notations on the Web, just as HTML does for ordinary text.
Metadata: Data that describes other data. Metadata about an XML document is described in the DTD or in the XML document itself, enabling other applications to interact with it.
Metalanguage: A language that describes other languages. SGML and XML can be considered metalanguages because they define markup languages.
namespace: A
namespace is a set of unique identifiers.
Notations identify by name
the format of unparsed
entities, the format of elements which bear a notation attribute, or the
application to which a processing instruction is addressed.
Notation declarations provide a
name for the notation, for use in entity and attribute-list declarations and in
attribute specifications, and an external identifier for the notation which may
allow an XML processor or its client application to locate a helper application
capable of processing data in the given notation. http://www.w3.org/TR/REC-xml#dt-notation
OASIS (Organization for Advancement of Structured Information Systems): A consortium of companies and individuals that collects and publishes XML specifications, DTDs, and schemas. By standardizing specifications, OASIS hopes to advance the open interchange of documents and structured information objects.
OSD (Open Software Description) Format: An XML-based specification designed by Microsoft and Marimba to automate software distribution. OSD uses unique XML tags to describe software packages.
parent element: An element containing other elements is known as the parent element. The elements contained within the parent element are known as child elements.
parser, XML : An XML parser is a processor that reads an XML document and determines the structure and properties of the data. If the parser goes beyond the XML rules for well-formedness and validates the document against an XML DTD, the parser is said to be a "validating" parser.
Processing
instructions (PIs) allow
documents to contain instructions for applications. PIs are not part of the document's character
data, but must be passed through to the application. The PI begins
with a target (PITarget)
used to identify the application to which the instruction is directed. The
target names "XML",
"xml", and so on are
reserved for standardization in this or future versions of this specification.
The XML Notation
mechanism may be used for formal declaration of PI targets. Parameter entity
references are not recognized within processing instructions.
RDF (Resource Description Framework): A model for describing and interchanging metadata. It allows a Web site to describe its dynamic (user-created) content without having to store static pages that contain that content.
RFC (Request for Comments): A document used by the IETF to describe the specifications for a recommended technology.
root element: Every XML document has one element that contains all other elements of the document. The root element is also called the document element.
Schema: A system of representing a data model that defines the data's elements and attributes, and the relationship among elements.
SGML (Standard Generalized Markup Language): The "mother of all markup languages," it's a metalanguage used to construct other markup languages. XML is designed to be "an extremely simple dialect of SGML" (per the W3C XML specs) for the Web.
SMIL (Synchronized Multimedia Integration Language): A language designed to integrate multimedia objects into a synchronized presentation.
SVG - Scalable Vector Graphics is a vector graphics language written in XML. Using SVG, graphics can be coded directly into an XML document. Benefits of SVG include:
q smaller file size than regular bitmapped graphics such as GIF and JPEG files.
q resolution independence, so that the image can scale down or up to fit proportionally into any size display on any type of Net device
q text labels and descriptions for searchability
q ability to link to parts of an image
q complex animation
Stylesheet: A stylesheet is a rule or sequence of rules that affect the appearance and/or structure of a document.
Tags: Tags are text structures that mark the beginning and end of elements within the XML document. Tags are markup characters.
Unicode: A superset of the ASCII character set, this 16-bit character encoding scheme includes not only the standard Roman and Greek alphabets, but also mathematical symbols, special punctuation, and non-Roman character sets (Hebrew, Chinese, etc.).
URI (Uniform Resource Identifier): The addressing technology by which URLs (Uniform Resource Locators) are created. Technically, http:// and ftp:// are specific subsets of a URI.
valid: An XML document with an associated document type declaration that follows all the rules of that declaration is valid.
well-formed: A well-formed XML document follows all the rules of the XML specification but is not necessarily valid according to an associated document type declaration.
XFRML (Extensible Financial Reporting Markup Language): The new "digital language of business" supported and proposed by the American Institute of Certified Public Accountants, which allows the financial community to exchange and analyze a variety of financial reports. Still a work in progress.
XHTML (Extensible HyperText Markup Language): The "XML-ization of HTML"—essentially the "newest version" of HTML, which extends its functionality to support a wider range of devices and applications.
XLink: A package of hyperlinking functionality that comes in two parts. "XLink" governs how links are inserted into an XML document; "XPointer" determines the identifier that goes on a URL when linking to an XML document from somewhere else, such as another Web page. Formerly known as XLL (Extensible Linking Language).
XLL (Extensible Linking Language): The standard for describing links among objects in XML documents. (See XLink.)
XMI (XML Metadata Interchange): An open information interchange model intended to give developers working with object technology the ability to exchange programming data over the Internet in a standardized way, bringing consistency and compatibility to applications created in collaborative environments. XMI is intended to be either stored in a traditional file system or streamed across the Internet from a database or repository.
XML (Extensible Markup Language): A data format for structured document interchange that is more flexible than HTML. While HTML's tags are predefined, XML allows tags to be defined by the developer of the page. Thus, XML-defined Web pages can function like database records.
XML aware: Any software application that recognizes the XML data format and understands XML concepts. Often XML aware software contains an embedded XML parser.
XML declaration: An XML declaration is an optional declaration at the top of an XML document that specifies the version of XML and an encoding declaration.
XML dialect: Any "flavor" of XML defined by a DTD that is designed to support a specialized purpose, such as BIOML (BIOpolymer Markup Language), CML (Chemical Markup Language), MathML, CDF, TalkML (an experimental XML for voice browsers), XFRML, etc.
XML editors: Software that allows basic data/metadata editing functions and explicit control over XML markup. Products run the gamut from simple editors for small documents, such as Language Technology Group's XED, to more full-featured XML "word processors," such as Icon's XML Spy, Vervet Logic's XML Pro, and SoftQuad's XMetaL.
XML entities: Special sets of characters that help expand document content without increasing the overall character count. Internal entities act as typing shortcuts or macros; external entities incorporate content from outside sources into the main document.
XML namespace: A way of defining each element type and attribute name in an XML document unambiguously (through associations with specific URIs) so that two or more XML-based languages may be used in that document without creating a conflict.
XML processor: A software module that reads XML documents and provides access to their content and structure. The processor does its work on behalf of another module, called the application. The processor reads the XML data and provides the application with the information.
XML vocabulary: An XML vocabulary is an XML tag set with a specific functionality. SMIL, WIDL, MathML, and ICE are all examples of XML vocabularies.
XML-QL (XML Query Language): A query language for XML, which, like SQL, has a SELECTWHERE construct and uses features of query languages developed for semi-structured data. XML-QL is a competing proposal to XPath, but is not likely to be adopted as a recommendation by the W3C.
XMOP (XML Metadata Object Persistence): A set of components that allows the interoperation between object technologies such as Java, Microsoft COM, and CORBA. This means that objects can be transported between different object systems (COM and Java) and different Java VMs (Microsoft's and Sun's).
XPath (XML Path Language): A way of referencing information within an XML document, intended as a bridge between XPointer and XSLT. XPath uses a directory notation to perform queries through the selectNodes architecture and lets you determine which elements within an XML document satisfy a given set of criteria.
XSL (Extensible Style Language): The style standard for XML. Like CSS, it specifies the presentation and appearance of an XML document.
XSLT (XSL Transformations Language): A language used to transform (reformat) XML documents into other XML documents. XSLT supports both push and pull transformations and is designed to be used independently of XSL; however, it is not intended to function as a general-purpose XML transformation language.