Monday, June 23, 2008

Is XML worth it?

Every now and then I meet people who hate XML (some of my lab mates are also in the 'XML sucks crowd'). As a person interested in XML related technologies, I thought of looking into this issue and see if XML really sucks or it's just that you cannot satisfy everybody.

One of the criticism is that XML is slow to parse compared to flat files such as CSV files. This fact about the speed is true. However, the comparison is quite flawed as it's like comparing apples with oranges. XML provides a parsable definition of a document structure while CSV's or similar approaches do not. We need to realize that the benefit comes at an extra cost!

In XML-sucks web sites, they deliver the message that XML is hard to deal with in programming paradigm. IMO, this fact was true during the early days when XML was invented about a decade ago where there were only a few libraries and tools to work with. Now, the situation has changed; there are many libraries (both DOM and SAX) for various languages which are not that hard to master once you spend some to learn the basics.

Another criticism is that XML is verbose (use more bandwidth than required to do the job). With the multimedia age, I don't think this addition of markups going to create a huge difference in the bandwidth usage (provided that we judiciously use namespaces). Further, there are compression techniques on XML already available which help reduce the message size. If you are really really concerned about the message sizes and the interoperability or any other cool things about XML are not required, I suggest you don't use XML. In my opinion, the gain we get is well worth the price we pay for the additional
bandwidth, which is after all textual information.

I agree that while XML itself is quite simple, it gets complicated with the proliferation of related technologies. However, complexity is partly under our control as we decide what we want to use along with XML. The trick is to use only those technologies that are absolutely necessary and have matured to become standards.

XML is for hierarchically structured data and can be an overkill for data with no hierarchy. For example, if a property-value file (like in Java) is sufficient for the job, you will not gain anything by using XML but loose for performance. XML-suck groups compare the data of the latter type in XML with other alternatives available and gives a false impression. If your data is not hierarchically structured and do not have a requirement to use a specific document format, I would go for a flat file format.

Some of the cool features of XML that makes it stand out from other approaches:

Addresses the internationalization (i18n for short) issue
If you don't know what encoding the data is in, the data is virtually useless. With the global economy becoming one single market, internationalization (which allows to adapt to various languages and regions without engineering changes) of data becomes necessary. So, the encoding becomes an import factor. An XML document knows what encoding it's in. There's no ambiguity in an XML document which is a sequence of characters and these characters are coded and then encoded in a specific character encoding. The encoding used decides the number of bytes required for the coded characters (for example UTF-8 requires 1 byte). No matter what software you use, it can process any XML document making it ideal for data exchange.

Enforces creating well-formed documents
XML syntax enforces to have properly nested documents which allows you to verify the document is structurely correct to a certain degree.

Can enforce to create valid documents
In addition to proper nesting, you may need to make sure the document includes only certain elements and attributes, their data types, etc. These documents are declared to meet a DTD or XML Schema (successor of DTD). The nice thing about XML Schema is it itself is written in XML and there are a whole bunch of validation tools out there to validate the XML document against the schema.

Promotes syntax-level interoperability
In the software industry, interoperability has been one of the hardest goals to achieve. People initially thought that having well defined API's is the way to go. However, this approach does not work across heterogeneous operating systems. If you look at the success of the Internet and its related technologies, one thing is common; they all have bits-on-the-wire definitions (i.e. syntactic level).

A safe way of keeping data long term
As we all know, information/data outlives technologies we have today. Therefore, keeping data in proprietary formats may not be a good choice especially if the data are anticipated to be used much later down the time line. XML solves this problem as it's an open specification and is not tied to any specific technology.

To sum up, if you use XML only for those tasks for which it is designed to be used, it can pay you off; XML is not a solution to every data problem we encounter.

Main References:
Why XML Doesn't Suck

1 comment:

Kannan Goundan said...

This is a low-quality ripoff of Tim Bray's ancient "Why XML Doesn't Suck" article.

The primary problem with XML is that the data model is horrible. Most programming languages have data types that are are simple, easy to understand, and easy to use. XML's data model is overly complex; my guess is that the internal model was lifted, without proper abstraction, from the SGML/XML syntax. This was a mistake.

There are better ways of representing hierarchies without having free-standing text nodes pervade your structures. The fact that attributes can't hold complex values cripples expressiveness.

XML being text-based, i18n-friendly, and has strict validation doesn't make up for the fact that it's broken at the core. I can take any crappy data format and trivially make it text-based, i18n-friendly, and implement a strict validator.