Extensible MarkUp Language (XML) is making sharing of information and maintenance of Web sites a much more automated process.
October 12, 2003: XML, short for eXtensible Markup Language, is a data formatting markup language, whose role is to simplify the exchange of data and information across systems and organizations.
XML is similar to the language that most of today's Web pages are written in, the Hypertext Markup Language (HTML) in that they both use markup symbols, or tags, to describe the contents of a document. However, while HTML describes both the content AND formatting of a document, XML describes ONLY the actual data contained within the document. In an XML document there is no information about HOW the document should look. The look or presentation of the document is applied in a separate step which may depend on the delivery medium and the type of device that the document is viewed on (i.e. web, print, PDA, phone, etc.).
XML authors can define their own tags and tag sets, and can publish those definitions so that others may use them to exchange data.
Why and how does OBPR use XML?
In 2001 when the Office of Biological and Physical Research (OBPR) became NASA's
5th Enterprise, we needed to do a complete makeover of a 4-year old web site (1) that had
become a tangled mess of ugly HTML. Fortunately, our job of dealing with legacy
content was minimized since the Enterprise was new and there was a significant
structural change in the organization. The Web site required not only a new look, but also
virtually all new content. This gave us a clean slate to start with when considering how to
build the new site. Developers working on smaller Web sites at NASA Headquarters
were starting to become aware of the potential benefits of Java and XML in their work (2,3)
and, in fact, at least two pilots were already underway.
Above: Figure 1. The Office of Life and Microgravity Science and Applications main page prior to the creation
of the Biological and Physical Research Enterprise. Note the mismatch of the graphic fonts, a visible
manifestation of the underlying ugliness of the site. Credit: Alex Pline.
One of the primary requirements of the new site was that we needed an automated way to
mange time sensitive news items to avoid one of the greatest Web faux pas, a "What's
New" page that is out of date. This typically happens when items are updated manually.
While there are many ways to automate this task, we began to research the benefits of
using XML after seeing the RSS-like, XML-based feed (4) of news items from the
Science@NASA group (5) at the Marshall Space Flight Center. Since there was no DTD to
describe the structure of this XML file, we created one and included some additional
elements to suit our needs. As we progressed, we began to see the benefits of using XML
to separate content and presentation.
As a temporary solution to the Web site update, we set up a short term HTML "mini-site"
to help us refine our graphical and management requirements. Armed with the XML
format and a DTD, we developed a method of creating HTML versions of the news
information by inserting the news content into HTML templates, using Open Source tools
such as Apache Tomcat (6) (a Java servlet container) and JDOM (7) (an API for representation
of an XML document as a Java class). The servlet implementation worked very well,
exceeded our expectations, and encouraged us to move to an all-XML native site. We
continued to extend our use of servlets for transforming XML to HTML as well as other
site management tasks.
This project has evolved from the initial idea of using XML for a specific task on a Web
site to using XML as the native format for the entire site, embracing the use of XSLT for
all aspects of creating and maintaining the Web site. The site and the underlying
architecture has been an ongoing and evolving process, with improvements in site
architecture and operations being added as requirements arise in a manner that is
characteristic of "prototype" or "pilot" program.
In the following sections we describe the overall project requirements, system
architecture, content operations and upgrades to the site content and XML administration
tools to illustrate the benefits (as well as the downside) of using XML as a native format.
Project Requirements
The basic requirements of the Web site upgrade fell into four main areas:
Content (the types of data structures);
Design (the layout of content and graphical elements);
Distribution of Content (media channels);
Maintainability (requirements for ease of content changes and site upgrades).
While good maintainability can be achieved with current HTML-based Web site
development and administration tools, the advantages that XML offered as a flexible
native format coupled with a rapidly growing selection of tools for processing and storing
XML-based content made XML an obvious choice.
Content: One of the major planning tasks that we had to undertake in using XML was the initial
modeling of the data structure. Significant planning had to be done a priori to insure that
the DTD or schema we created (or adopted) could accommodate all of the types of
content anticipated for the site. We also needed to consider how we were going to author
content in XML.
News items: The minimum set of required data elements for each news item included: title, category,
start date, end date, date of the item, associated image and its properties and descriptive
text. In order to avoid the "out of date What's New page" problem, and to enable us to
enter time-sensitive items well in advance (e.g., announcements of research
opportunities) we needed to create a mechanism that allowed for automated revealing and
hiding of news items based on their start and end dates. An item would not be shown
before its start date and would be archived to an "old news" page after its end date.
Web Content: The minimum set of required data elements for "general web content" (the types of
structures generally found on web sites) included, not surprisingly, many existing HTML
tags, such as paragraphs of text, page headings, tables, lists, images, captions etc.
However, we also had several requirements that included container-type data elements
such as page heading, title, sections and sub sections, and elements to hold meta data and
internal document links. This minimum set had to be expanded later to accommodate
other specific types of content, for example, a magazine article. This "richer" set of
elements included meta data such as authors, editors and source credit as well as more
sophisticated ways of handling images, image credits and captions.
Content Creation: At the planning phase of this project, XML native editors were at a primitive stage,
requiring the author to work in a tagged environment. We evaluated several software
packages and as a result were resigned to working in the tagged environment with a
minimum requirement that the author have the ability to preview the content locally in its
final transformed state prior to posting on the Web site (or other distribution channel).
Site Design: The requirements for the design of the site were developed as one would any HTMLbased
site. We chose to build the site around three sections aimed at specific audiences,
each with its own section navigation. However, we wanted to remain as true as possible
to the concept of separation of presentation (formatting and navigational elements) and
content (data) to insure that any piece of content could be used in any section. This meant
no "hardwiring" of section specific information into the XML content. We also wanted
the site to have common header and footer elements (text, links and graphics) that needed
to be added to each page.
Distribution of Content: Not only did we intend for the content to be used as HTML on a Web site, but we also
wanted the ability to repurpose the content to other distribution channels without having
to manually reformat the content for any specific channel. For example, the content might
be distributed as a printed document, plain text in an e-mail, or very simple HTML for
handheld devices. This also required having the content sans any channel specific
information.
Maintainability: As mentioned above, the maintenance of the news items in an automated way was
paramount. Moving items between HTML (or XML) pages manually was not practical.
We wanted to be able to enter the specific news item information via a Web form only
once, save it as XML, and have the item appear on the appropriate page(s) depending on
its date and category properties.
Along with the ability to reuse and redistribute the content without manual reformatting,
the ability to maintain or completely change the look and feel (e.g., fonts, color scheme,
etc.) and the common elements throughout the entire site, or that of any specific section,
without changing any of the XML content files, was a key requirement.
System Architecture
Content Types and Structure: We created two DTDs; Newsfeed.dtd (8) handles the data requirements for the news items
and OBPRWebContent.dtd (9) handles the data requirements of most of the Web content.
We picked many of the element names to be similar to their HTML counterparts, a
practice that helps understand the role they play, but that also may add to some
syntactical confusion when switching between XML and HTML. A complete list of the
elements can be found in appendices A and B.
Note that in the Newsfeed DTD the intro element, the descriptive text of the item,
allows for child elements that are identical to HTML elements, allowing an author to
embed HTML formatting in the description of the news item. This gives the author a lot
of flexibility in formatting the text of the news item but, on the downside, it has the
potential to cause problems when attempting to parse this data if it is not syntactically
correct.
Alternately, one could use the XHTML (10) specification if there were no requirements for
additional data elements, or a combination of XHTML and application specific elements
could be constructed. This may provide some benefit as the XHTML specification will be
very familiar to people used to working with HTML and potentially require less training
to implement. For example the new xml.nasa.gov site is XHTML compliant.
After attempting to create "feature articles" in the OBPRWebContent.dtd structure, we
determined that we did not have enough flexibility in the types of elements. For example
there were many pieces of data, both content and meta data, that were associated with
more complex content that we might want to process in different ways when
transforming from XML to another format. These meta data include author, editor,
synopsis, teaser, and sidebar. In the OBPRWebContent.dtd structure, we would have had
to treat these as paragraphs of text; without separate elements for these pieces of data we
would not be able to include, exclude or specially format them based on the distribution
or audience requirements. Instead of developing our own structure, we searched for one
that would suit our needs. We found that the Science@NASA organization had
developed a DTD to work with their "magazine article" content, so we adopted their
DTD for our own article content. Not only did adopting the Science@NASA article DTD
solve our problem of needing a more robust DTD for articles, it also facilitated the reuse
their content as well. See appendix C for a listing of the elements and their properties in
SCI-Story.dtd.
XSLT Infrastructure: In order to transform the XML to other formats we are using Extensible Style Sheet
Language Transformations (XSLT). (11) These transformations use an XSLT processor that
parses the XML file and applies a stylesheet written in the Extensible Style Sheet
Language (XSL, an XML dialect) to output a variety of formats. The format that is
ultimately output is based on the stylesheet and processor. There are many Open Source
XSLT processors to choose from for transforming XML, such as those offered by the
Apache XML Project (12) (for example, Xalan and FOP). We are currently using the Open
Source Saxon XSLT processor, developed by Michael Kay. (13)
In the creation of the HTML for the Web site, several transformations are performed.
News items are first transformed to the OBPR Web site format via an XML to XML
transformation (from the newsfeed.dtd format to the OBPRWebContent.dtd format).
Once all XML files are in this format a second process transforms all XML content to
HTML. The transformations are performed automatically via either batch or manual
processes, using compiled XSL stylesheets for optimization. Internal tracking of
timestamps assures that only those files that have changed since their last transformation
are processed.
News Items: Initially we did not use XSLT for the news items. Instead, the HTML pages were
rendered using compiled Java code (templates) that contained the HTML markup
embedded in System.out.println() statements. This crude approach posed several
problems and proved to be too difficult to maintain. We subsequently switched to XSLT
transformations, as described above, in which an intermediate XML to XML
transformation is done to recast the news items as "standard" site XML files valid to the
OBPRWebContent.dtd format. After this is done, these XML files are transformed to
HTML along with the rest of the site content.
Transitioning from the template to the XSLT approach for the news items raised an
interesting problem. As mentioned in the Content Types and Structure section above, the
intro element for each news item could potentially contain embedded HTML. If the
HTML markup is entered incorrectly the news item will not be well-formed XHTML and
the transformation will abort. We therefore had to add a step in the processing to check
for well-formed XHTML using the Open Source program JTidy (14) (a Java version of
Tidy). This step throws an exception when the transformation fails, bringing up another
form through which the author may chose to accept Tidy's corrections or to correct the
mistakes manually.
During the XML to XML news item transformation a variety of OBPRWebContent.dtd
formatted XML files are produced, based on the category and date elements in each news
item. This provides a mechanism for creating separate HTML files on the Web site for
each category of current news items (including an "all items" page) as well as archive
files of news items separated by year of publication. Currently we have more than 300
news items spread over 6 categories and 3 years. It would a be very menial and time
consuming job to manually move items between files as time progresses. Having the
ability to enter the content once and automate this process based on the meta data for
each news item, saves a tremendous amount of manual effort and assures that the news
pages will always be current. Below is a typical news page. Note the internal
subnavigation defined in the OBPRWebContent.dtd format allows for both links and
Javascript dropdown menu. All of the HTML code to accomplish this is created during
the XSLT process.
Above: Figure 2. A typical news page on spaceresearch.nasa.gov showing the category dropdown and
archive menus that select the various news pages. All of these HTML elements and pages are
generated via the XSLT process. Credit: Alex Pline.
Another file that is created during the news item XML to XML transformation is a "top
5" version of the newsfeed XML file. One the main page of the site a Flash application
displays the top 5 news items. One of the benefits of XML and Flash (v. 5 or higher) is
that the Flash application will read the XML directly. Once again the news information,
displayed in another format in a different location, is assured to be in sync with the rest of
the site since they share the same source file. Note that the main page is the only native
HTML page on the site. This was a conscious choice: there is only one instance of this
page and it would be even more work to maintain a separate XSL stylesheet, and its
associated transformation code, than to maintain the HTML directly.
Above: Figure 3.The spaceresearch.nasa.gov main page showing the Flash application news scroller. Credit: Alex Pline.
Main Site Content: Initially our achitecture included an XSL stylesheet for each of the three main section of
the site ("General Information", "Research & Projects", and "Fun & Learning"), in
which the stylesheet added the common site elements (header, footer etc.) as well as the
specific navigation for that section. There were several problems with this architecture.
While using XSLT to create HTML reduced the places we needed to make changes to the
three individual XSL stylesheets, we were still having to make modifications to other
files where the section navigation show up, for example, arrays in Javascript files for
dropdown menus in each section. Also, if we wanted to expand the number of sections,
we would have to create another XSL stylesheet.
Above: Figure 4. A typical page on spaceresearch.nasa.gov showing the various elements that are assembled
during the XSLT process. Credit: Alex Pline.
These issues drove us to create a much more modular XSLT infrastructure for many of
the components that are assembled in the XSLT process. By doing this, all of the
"source" information used to create the HTML files that was previously hard coded into
the XSL stylesheet, either in individual section stylesheets or in multiple places in a
single stylesheet, could now be placed in one set of configuration files used in the XSLT
process. This included a master XML configuration file containing all the relevant
information for each section and various XML (actually XHTML) files containing
snippets of code for the header, footer, each section navigation and top navigation used in
each section. This allowed us to create a single general XSL stylesheet, which would
include these code snippets based on the information in the configuration file and apply
them during the transformation. The XSLT also created the appropriate Javascript arrays
containing the section navigation items. These Javascript arrays are used in every page,
including the Web site's main page. Below is a snippet of the XML configuration file
showing the locations of the various XHTML objects that are assembled during the
XSLT process:
Above: A configuration file code snippet. Credit: Alex Pline.
This indicates that navigation items for the "General Information" section are controlled
by elements in the SectionNav_general_info.xml XHTML file:
Above: A configuration file code snippet for the left navigation. Credit: Alex Pline.
As a result, we only need to change information in a single place to affect changes
throughout the entire site. This includes, adding new major sections, section navigation
items, header, footer and section colors. Once the change is made, all the XML files are
transformed and the site HTML is updated.
The details of the main XSL stylesheet are available on the Web site. (15) In a further
attempt to modularize, this stylesheet references other stylesheets, located at
dtd.nasa.gov, which handle date/time, string, and file information issues.
Application Development:Java Framework: The synergy between Java and XML has been widely discussed. (16) It has been said that
Java provides "portable code" while XML provides "portable data." We recognized this
close connection very early in the design process and decided to base our implementation
on J2EE technology, in particular, Java servlets. Servlets provided a robust environment
in which to develop Web-based applications and the ready availability of mature Java
APIs for XML validation, parsing, and transforming greatly simplified the task at hand.
Using Java servlets in an application layer required a "servlet container" (engine). There
are many servlet containers available today. Many of these are bundled with expensive
middleware application server products, such as IBM's WebSphere and BEA System's
WebLogic. We chose the Tomcat servlet container, released under the Apache Software
License and used in the official Reference Implementation for the Java Servlet and
JavaServer Pages technologies. In doing so we were assured of having a high quality, low
cost (free), thoroughly tested, up-to-date platform upon which we could build our
application.
When we began this project we deployed Tomcat 3.2. At the time of this writing were are
using Tomcat 4.1.27 and will likely upgrade to Tomcat 5.0.x soon after its release. With
each new version, Tomcat improves upon prior versions in terms of security,
performance, remote access tools, scalability and reliability, integration with the
operating system, and session handling, to name just a few. Tomcat 4 implemented the
Servlet 2.3 and JavaServer Pages 1.2, and included many additional features that make it
a useful platform for developing and deploying web applications and web services.
Tomcat 5 implements the Servlet 2.4 and JavaServer Pages 2.0 specifications.
As mentioned, servlets provided a secure and stable platform for building server-based
Web applications. Because servlets are Java technology all of the Java class libraries that
have been developed for XML processing were immediately available for use with little
more effort than downloading the library and placing it in the proper Tomcat directory. In
our work to date the most important of these libraries have been JDOM for accessing,
manipulating, and outputting XML data, Saxon for XSLT processing, Ant (Apache), a
Java application build tool, Xerces (Apache) for XML parsing, JTidy for XHTML
validation, JavaMail (Sun) for e-mail messaging, and Velocity (Apache) for JSP-like
template processing.
OBPR XML Administration Tool Operations: All of the operations of the XML Administration tool running within the Tomcat
framework are performed from a Web-based administration panel on a Unix server. This
box also acts as our site staging server for previewing changes in a protected
environment. All transformations run daily as a cron job so that XML content added to
the server will update the site automatically. These processes can be run manually as
well, in case there is a need to add or change content immediately.
On the main administration page there are options to manage the news items, perform
manual transformations (news items, Web content or both) and preview the entire site or
individual XML/HTML pages. This interface makes the tool incredibly easy to use.
Although we usually use FTP to transfer files to the server, a file upload option was
added for transferring files when FTP is not an option, for example when using the
NASA HQ Secure Nomadic Access application for remote access through the firewall.
Above: Figure 5. The OBPR XML Administration Page used to manage news items and XSLT
transformations Credit: Alex Pline.
Above: Figure 6. Adding a news item is as easy as filling out this web form. Dave validation is performed in the
background and Tidy performs XHTML validation on the "Text of Item" field when the item is
submitted. The system enters this data into the newsfeed.xml file. The "Preview HTML" button allows
the author to see how the item will appear when rendered as HTML. Credit: Alex Pline.
Examples of Site Changes and Upgrades
In the three years that the site has been operational, we have made a number of site
changes that have been very easy to implement given the native XML format and XSLT
infrastructure. We routinely add or change the section navigation and update
header/footer links. To date, we have added one new major section. The new section was
for STS-107, an OBPR research mission. We were able to easily add a custom header and
section navigation and assign colors, all through the configuration files.
When OBPR created an e-mail list for distribution of news items and announcements, it
proved easy to transform the news items from XML to stylized text that includes
hyperlinks pointing to the full news items on the Web site. The default item text can be
edited in a Web form and previewed prior to e-mailing to the list. The system
simultaneously produces an RSS feed containing the news items that were e-mailed to the
list, providing a second method for distributing this content. This feed is syndicated
through several RSS collection sites, such as Syndic8.com.
Above: Figure 7. The STS-107section added midway through the site's life. Note the custom header image. Credit: Alex Pline.
Another recent addition to the site was a "printer friendly" format of all site pages. By
developing new XSL stylesheet (OBPRWebContentPrint.xsl), (17) we are able to produce
HTML pages that print gracefully through the browser. A small change to the DTD was
made to add an attribute to any inline link to specify whether the printer friendly XSL
should display the link URL. After this XSL was developed the OBPRWebContent.xsl
stylesheet was changed to add a link on each page to the printer friendly version. The
entire site was then transformed and updated automatically. "Printer friendly" pages are
now automatically produced for all site content.
Above: Figure 7. The STS-107section added midway through the site's life. Note the custom header image. Credit: Alex Pline.
We have been experimenting informally with using XSL Formatting Objects (XSL-FO)11
and an Open Source XSL-FO processor from Apache called FOP for producing PDF files
from our XML content. This experiment has been only moderately successful due
primarily to the limitations of this Open Source tool. There are a number of upcoming
tools, both Open Source and commercial products, that are addressing the shortcomings
of XSL-FO and we expect that we will achieve greater success with this in the near
future.
Science@NASA Content and the NASA Public Web Portal: When we adopted the Science@NASA format (SCI-Story.dtd) (18) for our "feature article"
content, there were two motivations. One, the format fits the OBPR article content better
than the OBPRWebContent.dtd format, and two, Science@NASA develops a significant
amount of content for OBPR and we wanted to be able to easily repurpose this content on
our site instead of linking to their standard publications on science.nasa.gov. We
developed two XSL stylesheets, OBPRArticles.xsl (19) and OBPRArticlesPrint.xsl, (20) normal and "printer-friendly," respectively, for use with our XSLT infrastructure. With
these stylesheets we can automatically transform the Science@NASA articles in a
manner analogous to the way we transform Web content using the OBPRWebContent.xsl
styleshheets.
When the NASA Public Web Portal (www.nasa.gov) came online, we immediately began
investigating how to supply content directly via XML rather than entering it through the
Portal content management system (CMS). The Portal project has published an XML
schema (21) for importing content into the CMS. The overview of the schema shows
elements for meta data and one large CDATA section for the content body. This content
body is HTML that must be formatted according to the Portal style guidelines.
Since the Science@NASA content is now fully developed in XML, and we have a
significant amount of experience with both the SCI-Story DTD and the Portal schema,
OBPR has offered our XSLT infrastructure to transform the Science@NASA content to
the Portal Import format on an ongoing basis. We have developed an XSL stylesheet
(OBPRArticlesPortalDetail.xsl) (22) that maps the elements in the SCI-Story format to meta
data elements in the Portal Import format and transforms the body content into HTML
that conforms to the Portal style guidelines. It is planned to automate this transformation
and subsequent importing into the CMS. To date, this transformation is in the testing
phase and will soon be put into production. For comparison, it takes a person familiar
with the CMS 2-3 hours to reformat HTML from science.nasa.gov for the Portal, whereas
our approach is fully automatic. Science@NASA publishes about 3 articles per week, so
the use of this transformation saves about 6-9 person hours per week and is implemented
in a significantly more accurate and consistent manner.
Conclusion
Benefits of XML as a Native Format:The basic tenant of XML, separation of content and presentation, has produced
significant efficiencies in the way we manage and maintain the OBPR Web site. Beyond
the savings in human maintenance time, using XML has allowed us to consolidate the
portions of source information that comprise the site into canonical locations which can
be read, processed, displayed and distributed in a variety of ways. Once this data is in an
"available" format, the information is truly "extensible" and will result in additional
efficiencies as new uses are developed, so the benefits go far beyond the first order goal
of increasing Web site maintainability. While, there is a significant learning curve in the
development of an XSLT infrastructure, this kind of approach produces a significant
return on investment through savings on maintenance costs, better data quality and
currency, and a greater number of potential uses for the XML-based data.
Continuing Problem - WYSIWYG XML Editing:
When starting this project, we knew that tools for creating well-formed and valid XML in
a familiar (e.g. WYSIWYG) environment would be a challenge. This is currently the
Achilles heel of creating XML content. At that time, the best tool we could find was
called XML Writer, a well designed "tagged environment" editor that also had the
capability of performing local XSLT transformations.
Above: Figure 9. An XML Writer screen shot. It makes working in a tagged environment easier than Notepad,
offers configurability for custom XSLT processors and project management, however is not appropriate
for not technical users. Credit: Alex Pline.
While this application is certainly appropriate for technical users, it is not appropriate for
the average non-technical content creator used to working in applications such as
Microsoft Word. We have looked into various Microsoft Word add-ons (23) which output
XML by constraining users to various Word styles and mapping those styles to XML
elements. Evaluations of the products show that they are quite limited and often require a
significant amount of customization of Word, depending on the structure of the XML.
Since our XML is "semi-structured" in that there is a large variability in kinds and
numbers of element, this was especially problematic. We did not believe that the benefit
of customizing Word with these plug-ins was worth the effort, especially since our use of
XML is currently limited to the Web application. We have also conducted extensive
analyses of the new Office 2003 suite, (24) which provides the most XML-aware versions of
the Office products yet. However, based on our evaluation of the beta product, using our
"semi-structured" types of content, it appears that Word 2003 will still require some
assistance from a plug-in.
As XML becomes more prevalent in our organization we will have to reevaluate the
available products. The field is rapidly changing and new products show great promise in
helping to solve this problem. Companies that are producing stand-alone XML editors,
such as Arbortext and Adobe, have promising products of various levels of complexity
and sophistication.
Web Links
Why Use XML For Web Content? (http://spaceresearch.nasa.gov/general_info/xmldriven.pdf)
-- A downloadable PDF of this article, which includes all indecies and links to references numbered in the article.
Authors:
Alex Pline, Bruce Altner, Nathan Shaw, Colin Enger
Editor: Alex Pline
(obpr@hq.nasa.gov) Find this page on the web at: