XML notes and tips

One way to look at XML is that it's a replacement for delimited data. With the advent of inexpensive storage and faster data connections, there is no longer a general need to conserve space in the transmission and storage of data.

An oversimplified example to illustrate some of the advantages of XML

A tab delimited data file may look like this:
1    2003    12.20    4
2    2004    12.10    3
3    2003    111.10    6

The data is compact with only a single tab character delimiting the fields and each record is separated with a line feed.  In the past when storage was expensive and slow transmission rates, it was hard to justify the extra space to describe the data being sent or stored.

With XML it is now possible to formally include much more information inside and external to the data.  With delimited data a prior arrangement must be made between the sender and the receiver programs and programmers.  It is most common to produce a human readable specification for the programmers.  They use this specification to decide how to read and write this data.  There is no global standard on how to do this.

One advantage of XML is a program readable data specification called a DTD.  This is stored outside the data, but the data can have a pointer to it.  It is possible to write a program that can use the DTD to make good guesses at to what goes in the fields and how the fields should be arranged.  This DTD specification is used in the reading and writing of well formed XML documents.  It can be read by humans, but more likely by a general purpose program.

Another advantage is that it's possible for a human to have a chance at reading and understanding the data is since it's either a tag or surrounded by tags.
<pettycash>
<transaction id="1"><employee id="4">Joe</employee><account id="2003">Food<amount dem="usd" qty="12.20">$12.20</amount></transaction>
<transaction id="2"><employee id="3">Jane</employee><account id="2004">Supplies<amount dem="usd" qty="12.10">$12.10</amount></transaction>
<transaction id="3"><employee id="4">Jim</employee><account id="2003">Food<amount dem="usd" qty="111.10">$111.10</amount></transaction>
</pettycash>

The same data with the tags hidden (for human consumption)
Joe        Food        $12.20
Jane    Supplies    $12.10
Jim        Food        $111.10

Tags can also let you specify exactly what an item is so it can be converted to other localities or units.  The amount tag for example has enough information to convert the text portion to any currency.  You may even want to add a date attribute or tag so the currency conversions can be done based on time.  Also you may want to know when these transactions happened.

Yet another of may advantages is that the data order can be flexible.  There is no need for the items inside of the transaction tag to be in any particular order.

Before I go on, let me explain the anatomy of an XML document in simple terms

  • An XML document is a text document where human readable information is encapsulated inside of tags.
  • It is possible for tags to be empty, or not to contain any human readable content.
  • It is possible for tags to contain more tags.

 

A tag is surrounded by the grater-than and less-than symbols.  Within a tag you must have a name and any number of optional attributes.

 

For example <firstname>Jim</firstname>  The tag name is 'firstname' and it contians the human readable text 'Jim'.  <img src="tick.jpg" type="image/jpeg" /> is an example of a tag that does not have any text, is named 'img' and has the 'src' attribute set to 'tick.jpg' and type attribute set to 'image/jpeg'.  The external specification or DTD can not only define the tags, but can also can contain the default attributes of any tag.  These attributes don't need to be defined in the XML document when they are to be the default value.



The most common XML design mistake I see is creating the need to re-parse the data in an XML document

That is, once you have the output of an XML parser, you need to write more code to interpret the text within a tag.

 

An example of this is a link tag that looks like this:

 

<link>http://www.jazd.com/blog?page=1</link>

 

With the above example, where the URL is in the text portion of the tag, to get the protocol you are forced to detect the 'http', or pull it out of the string of text.

 

It would be better to use <link href="http://www.jazd.com/blog?page=1"/> instead.  Or you can get carried away and use <link protocol="http" host="www.jazd.com" path="/blog" parameters="page=1"/>

 

Any good XML parser will give you each of the protocol and other attributes directly and with no need of further parsing.

 

Some cases it may make sense to include actual data in the text portion of the tag.  For example the date-time tag <datetime epoc="1165005271">12/01/2006 12:34:31</datetime>.  But even here I could not resist the temptation to use a standard time as an attribute.  The text portion of the tag is for human consumption only and is ignored by a reading program.  Without the epoc attribute, a more complicated date parsing program is required.

 

In the above delimited data example, I would add the epoc attribute to both the transaction and amount tags.  This will allow a fixed date for currency conversion, but also allowing for  the tracking of the the individual transaction times.



Another mistake I don't like are encapsulating tags that are unneeded

<feed>

 

<channel>

 

<title/>

 

<link/>

 

<item><title/></item>

 

<item><title/></item>

 

</channel>

 

</feed>

 

If only one channel is allowed, then why not do this instead:

 

<feed>

 

<title/>

 

<link/>

 

<item><title/></item>

 

<item><title/></item>

 

</feed>

 

Though descriptive, the channel tag is not needed and can even be confusing when only one channel is actually allowed.