How to Prepare Legacy Data for XML

ASCII Strings Approach

Introduction

This is a summary of my experience converting legacy data into a format that can be included in an XML or HTML document and that accurately represents the original scripts. (In this context a script is a collection of letters and other written signs used to represent textual information in a writing system see: http://www.unicode.org/glossary/).

Today's web provides lots of examples of how to represent characters from the world's languages and it also demonstrates that the simple English, Latin-based techniques do not work well for many other languages.

The explosion of XML and of UTF-8 encoding has established theses two – by consensus – as the standard for representing content, at least for data interchange and possibly also for storage. Now the issue is how to convert the legacy content into a format that can be included in a modern XML document

ASCII Strings Conversion Process

The ASCII Strings Conversion Process relies on initially handling the data as a simple sequence of characters (as 8-bit bytes) and sending it through four stages. Since they are just bytes, one can employ rather fast lookups and range checks to streamline the processing. This is important since the data will be handled multiple times before the conversion is complete.

For this exercise the input will be assumed to be mixed ... by that I mean it may contain different techniques for representing the scripts of any number of languages in the same data stream. It is these characteristics that make conversion of legacy data so challenging.

An inventory of some techniques used to represent characters includes:

The goal is to discover the correct Unicode code point for each character.

Stage I:

Identify bytes outside of standard 7-bit ASCII and substitute the correct character references to the Unicode code points for them.

This is the hardest part.
Somehow one must determine the format and encoding used in the original data ... is it: CP437, CP1252, ISO8859-1, ANSEL, some Windows double-byte code page, an OEM code page, an IBM code page, UTF-8, UTF-16, little/big-endian, ISO-2022, MARC-8 etc.? Many code page details can be found here and here. It would also be a good idea to search the web for specific vendors, particularly if you are dealing with legacy data.

Often this information is not documented or ambiguous, so an assumption (or multiple assumptions) must be made in order to classify it.

The goal is to correctly identify the encoding and develop a look-up that returns the correct numeric character reference for the encoded character.

If there is no documentation as to what code page or encoding is used, other methods must be employed — such as having a human read the raw data to infer what the language may be. Often one can find a character whose location in one code page is different from its position in another. By looking at the character in relation to the others around it, someone familiar with the language can determine which code page is in use.

It is often useful to look at the data in a hex dump to see what pattern the high ASCII data presents: are there escape characters before each sequence? (it may be ISO-2022); are there characters hex 1E, 1D and 1F? (it may be MARC); after an initial high ASCII character are there immediately characters above hex BF (it may be UTF-8)

Sometimes the file's creation date will narrow the possibilities (very early DOS data — before 1990 — would not be CP1252). Some trial and error may be necessary to come to a best guess as to the code page or encoding. The amount and variety of the data that is being examined will determine how correct the assessment turns out to be when the whole dataset is processed.  A fall-back encoding must be decided upon so that ambiguous data can be handled.

Because at this stage we are handling bytes above hex 7F in a mixed data stream, we need to assume that some data could already be UTF-8 or UTF-16. One first tests is to see if a string's High ASCII sequence is all valid UTF (i.e. follows the pattern of a UTF sequence see "Mapping codepoints to Unicode encoding forms"). If it is valid, the converter can then substitute character references for UTF-8 or UTF-16 sequences by simply referencing the code point that the UTF-8/16 encodes. If the sequence is invalid UTF, then additional tests for other expected patterns can be tried, eventually, if no pattern is confirmed, the fall-back encoding should be assumed.

One needs to develop converter code to classify and handle all characters above hex 7F. The goal is to end up with no High ASCII.

A special point about UTF-8: There is one aspect of this encoding that is often overlooked and needs to be considered -- that is, the processing of surrogates. UTF-8 does not allow encoding of characters between U+D800 and U+DFFF as individual code points. The converter must recognize this and handle these types of invalid encodings. One possibility is to treat them as CESU-8 (see footnote) which encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). If it finds a CESU-8 type reference to a surrogate pair it can generate the correct character reference to it. If the sequence is not valid CESU-8, a character reference to the replacement character (xFFFD) could be used.

Stage II:

Remove any remaining illegal characters.

Some characters in the 7-bit ASCII set are not allowed in XML. Also, some locations in high ASCII may have no look-up in some encodings, so they must be removed as well. (Although, if they are encountered, it likely means that the choice of encoding was incorrect or the data is corrupt.)

Develop a method to insure the remaining bytes are within the specification of a valid value for XML:

   ValidChar = #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Do not create a character reference to a disallowed character. Some analysis and substitution could be performed if the intention of the characters can be determined.  For example backspace between characters could indicate that the data was used on a line printer and character accenting (or other decoration) is being performed.

Stage III:

Convert entities to Unicode character references.

For this process it is necessary to develop a dictionary of all of the entities and their Unicode character reference equivalent. There are standard entity files (extension ENT) on the net that can be used for this.

The idea is to use the entity as the key and to replace it with its numeric character reference because the numeric character reference (the form with &# followed by either decimal or hex numbers and ending with a semi-colon) is valid in XML while the textual entity is not valid without a declaration in the document.

Insure that the entity look-up routine is coded so that it can distinguish a numeric character reference from an entity (leaving the numeric character references untouched).

For any entity that is not in the dictionary, it  should "un-entitize" it by replacing the ampersand with "&". This is a last resort behavior since it means that the entity itself will be visible when the record is displayed (but at least it won’t break the XML).

Stage IV:

Encode remaining characters used in XML markup to the XML standard entities.

There are five characters that have standard, predefined entities in XML ( & < > " ' ). Only three of these characters matter in our case because the data being converted is assumed not to be XML (i.e. does not contain XML markup); it is purely content.

The double and single quote are only invalid when they are used in XML attributes and since this is content, not markup, they are allowed as regular characters. The other three should be converted to their standard entities ("<" becomes "&lt;" and ">" becomes "&gt;" and "&" becomes "&amp;").

 The process is a simple search and replace except for "&". Because the data that being processed contains valid character references, one must be careful to only replace ampersands that are "naked" -- not a part of a character reference. The regex pattern for recognizing these ampersands is:

       “&(?!#\d|#x|amp;|gt;|lt;|apos;|quot;)”

Conclusion

The result of this processing is text that should be valid in an XML document. It is plain ASCII. It contains no illegal characters (and in fact no encoded characters like UTF-8 either). It contains no entities, except for those allowed in XML without external definition. And it retains the non-ASCII characters as character references to Unicode code-points.

In general, the final step in the packaging of the data as an XML document should be to send it through a SAX XML parser just to see that it does not fail -- then it can be written to the output file. This is the point where one can set UTF-8 as the output encoding for the output file, if so desired.

 

Footnote:

https://tools.ietf.org/html/rfc3629 (page-5):

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.  When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above (page-4).  This contrasts with CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant for use on the Internet. CESU-8 operates similarly to UTF-8 but encodes the UTF-16 code values (16-bit quantities) instead of the character number (code point). This leads to different results for character numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT valid UTF-8.