Tuesday, May 15, 2007

XML: Implications and Opportunities

Everybody's heard of XML and all the hype that comes with it. But what is XML? And what does it mean for publishers?

Often, content is authored, edited and published using a markup form that only controls style and presentation and not meaning. Examples of this are word processor files and web pages. XML offers the chance to capture the meaning of the content in the markup. Instead of this:

Allen, Thomas B. Vanishing Wildlife of North America. Washington, D.C.: National Geographic Society, 1974.

We have this:

<citation>
<author>Allen, Thomas B.</author>
<title>Vanishing Wildlife of North America</title>
<publisher>
<name>National Geographic Society</name>
<place>Washington, D.C.</place>
<date>1974</date>
</publisher>
</citation>

From that XML we could generate the same citation or any of a number of other formats. But, with this XML, we can also find our content with searches such as "author=Allen" and so on.

The key to understanding XML and it's implications for publishers is "Content Use and Reuse." Let's look at a scenario from a project on which I'm currently working.

Current awareness (news) content is published daily. It was originally written in a proprietary markup format, for which there were lots of in-house custom tools. All the editorial staff know the markup language. New hires have a long learning curve. We replaced the proprietary markup with an XML form. We hid the XML behind a word-processor interface and publish in a variety of forms (old and new) from the XML. This is feasible because of the open standards basis of XML and the freely available tools for managing XML. But really, it's just economics. Saved money. A publisher embarking on this kind of restructuring can use any of a number of standardized content models as a starting point: ODF, DocBook, DITA, etc.

The real opportunity comes in the new ways the content can be extended and also reused. More can be done to describe the content, down to a very granular level. The XML form of the content can contain not only the content, but metadata describing the content. This might not appear in any one published version, but can be used for searching and linking applications. It can be published in many ways on different media through simple reformatting.

Semantic markup is the next big step. We capture not only the metadata, but also describe the "aboutness" of the content. From simple keywords to linked thesauri, to external, linked ontologies, we can capture many levels of domain meaning for the content.