This article describes the conversion of a single web page from HTML to XHTML.
This is just one step towards a much larger goal of improving the accessibility of an entire website. Going through this exercise for a representative page will give a good indication of what will be required to upgrade the rest of the site to XHTML.
The doctype - the HTML coding standard to which the page is supposed to conform - for the selected page is HTML 4.0. Its doctype declaration is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
Just for fun, we're going to upgrade the page through a series of XHTML doctypes:
<font> tag.Validation is the process of checking that a webpage complies with the rules of its stated doctype. There are various tools available to assist with this; we tend to use the W3C MarkUp Validation Service that can be found at http://validator.w3.org/.
The page chosen is the home page of the Friends of Strand School website. This is a website that we've looked after for some time.
The home page is a valid HTML 4.0 Transitional page.
To begin the exercise, we're going to make just one change to the page: change the doctype from HTML to XHTML 1.0 Transitional. The new doctype declaration is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-transitional.dtd">
Sending the resulting page to the validator returned a rather worrying 168 errors - plus a handful of warnings for good measure.
On closer inspection, the situation doesn't look so bad: many of the errors are of the same type.
Just for fun, we're going to make the necessary edits manually, and time how long it takes.
Start the clock.
Elements? Attributes? Elements are the building blocks of HTML - p, ul, li, img, a to list just a few. Elements may have attributes. For example, the element img has the attribute src; and the element a has the attribute href. In XHTML, both elements and attributes must be lower case:
Incorrect in XHTML:
<P> ... </P>
<A HREF="next_page.htm"> ... </A>
Correct in XHTML:
<p> ... </p>
<a href="next_page.htm"> ... </a>
Hunting down all the violations of this rule took about 15 minutes. Here's how the scorecard looks at this point:
| Number of errors of this type: | 113 | |
| Time taken to correct these errors: | 15 minutes | |
| Number of errors remaining: | 63 |
Another rule of XHTML is that all tags need to be closed:
Incorrect in XHTML:
<p>This is the first paragraph.
<p>This is the next paragraph.
Correct in XHTML:
<p>This is the first paragraph.</p>
<p>This is the next paragraph.</p>
No problems for our page for those elements that consist of pairs of tags. However, this rule also applies to single tag elements - the so-called "empty elements" - such as <input> and <img>. In XHMTL these tags are closed by inserting a space and a forward slash before the > character:
Incorrect in XHTML:
<img src="images/concorde.jpg" alt="Concorde">
Correct in XHTML:
<img src="images/concorde.jpg" alt="Concorde" />
As expected, the main offenders in the page were <input> and <img>.
| Number of errors of this type: | 54 | |
| Time taken to correct these errors: | 10 minutes | |
| Number of errors remaining: | 9 |
XHTML insists that attribute values be enclosed in double quotes:
Incorrect in XHTML:
<img src=images/concorde.jpg alt=Concorde />
Correct in XHTML:
<img src="images/concorde.jpg" alt="Concorde" />
Putting the quotes around the attribute values brought the number of errors down to... zero!
| Number of errors of this type: | 9 |
| Time taken to correct these errors: | 5 minutes |
| Number of errors remaining: | Zero! |
Not quite time for celebration. The validator gave us a few warnings that we need to respond to - we've used ampersands within attributes and this is another XHTML no-no. Replacing & with & took about a minute and gave us a page that achieved a clean sheet at the validator.
So, we fixed 168 errors (elements and attributes to lower case, tags closed, attribute values into double quotes) to get from a valid HTML 4.0 Transitional page to a valid XHTML 1.0 Transitional page in around 30 minutes.
The next step is to change the doctype to the XHTML 1.0 Strict doctype:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Running the page through the validator gave us a collection of just 7 errors:
Looking down the list, there are two types of error:
This essentially means that certain elements may appear only as children of certain other elements. For example, our element <input> on line 182 is a child of the <body> element; in other words the <body> and </body> tags are the only tags that enclose it. But XHTML 1.0 Strict does not allow <input> in this context.
We can fix this error by making <input> a child of one of a number of specified elements that are allowed in this context - <p> and <div> being the most likely to be used. This is illustrated below:
Incorrect in XHTML Strict:
<body>
<input type="submit" />
</body>
Correct in XHTML Strict:
<body>
<p>
<input type="submit" />
</p>
</body>
Removing the errors is as easy as removing the offending attributes. Although that gives us a valid XHTML 1.0 Strict page, the page now looks a little different: borders have appeared around some of the images, and some right-aligned elements are now left-aligned.
CSS to the rescue.
The border around images is there because the images are hyperlinked. As we never want the border to appear for hyperlinked images, the problem is solved with a few lines in the stylesheet:
a img{
border: none;
}
To replace the <p align="right"> it's easiest to add a bit of style to the paragraph tag itself:
Incorrect in XHMTL 1.0 Strict:
<p align="right">
Correct in XHTML 1.0 Strict:
<p style="text-align: right">
Similarly for the <img ... align="right" /> tags:
Incorrect in XHMTL 1.0 Strict:
<img ... align="right" />
Correct in XHTML 1.0 Strict:
<img ... style="float: right" />
So, with a little bit of HTML and a little bit of CSS, we've gone from a valid XHTML 1.0 Transitional page to a valid XHTML 1.0 Strict page.
The change to XHTML 1.1 was to essentially disallow the Transitional and Frames versions of XHTML 1.0: the differences between XHTML 1.0 Strict and XHMTL 1.1 are minimal:
lang attribute replaced by the xml:lang attribute.name attribute replaced by the id attribute for the a and map elements.So... same process again: change the doctype to XHTML 1.1...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
... and send the page to the validator.
Turns out that we didn't fall foul of any of these changes - the page validates immediately as XHTML 1.1
Our starting point was a valid HTML 4.0 Transitional page.
<input> in the correct contextThe first thing to say is that 168 errors - the number of validation errors we encountered when changing the doctype from HTML 4.0 Transitional to XHTML 1.0 Strict - is not a terrible score: we tried the same trick with the FT.com home page and generated 1,020 errors.
Also, the severity of the errors was low. They could almost be described as "typos" - a change of case here, the addition of quotation marks there.
Not all web pages will make such a smooth transition to XHMTL; it a reflection of the fact that the Friends of Strand School site has followed the spirit - if not always the letter - of XHTML since its inception.
Here we were forced to make our first non-trivial changes: we fell foul of context rules in a couple of places, and were forced to remove the last of our presentational attributes.
Again, this could have been far worse. The site's already extensive use of CSS made the transition to XHTML Strict a simple one.
This is just the end of the beginning. To make the sums easy, let's say that on the site as a whole, we'll need to correct 200 errors per page to move to XHMTL 1.1. There are about 50 pages on the site, so that's (200 x 50 = ) 10,000 errors.
Luckily for us, the site uses a content management system to generate each of its pages from a common template. It turns out that 150 of the original 168 errors occurred in the template part of the page. Once the template is corrected, we'll have just 20 or so errors to fix per page, or (20 x 50 =) 1,000 errors in all.