Xlint - an Error-Tolerant XML Parser

What is Xlint?

According to the Extensible Markup Language (XML) specification, violations of well-formedness constraints are fatal errors - once a fatal error is detected, the XML processor must not continue normal processing. Due to this requirement, all existing XML parsers will stop right after the first well-formedness error is detected. To be able to use an XML document for any purpose, we need to first remove all the well-formedness errors. Using a conventional parser to detect and fix the errors causes us to repeatedly run the parser to detect and remove each error. This is very inefficient for a huge XML document with many errors (some of them may be systematic errors caused by improper global replacements). A desirable parser should never stop and report all the well-formedness errors during or after a complete parse of the XML document. The Xlint is such an error-tolerant XML parser to facilitate the error removal of large XML documents. Xlint is implemented using Perl.

How Xlint Works?

Based on the XML document structure, we classify the errors in a non-well-formed XML document into the following two types:

(1) Syntax errors, including:

(2) Structural errors, including:

Xlint handles the first type of error in a recursive descent fashion, parsing all the constructs according to the grammar. The second type of error is handled by creating a tag-stack and explicitly manipulating the tags encountered in the XML document. For more detail, please refer to the Xlint document.

How to Use Xlint?

The Xlint is executed by the command:

perl xlint.pl <file_name> [-v |-v <number_of_chars>]

You must supply the “file_name” parameter which is the absolute or relative file name of the XML document to be parsed. Followed by the “file_name” you can use the optional parameters “-v” or “-v number_of_chars”:

-v

The verbose mode with default context length. A context of 30 characters around the error position is displayed.

-v  number_of_chars

The verbose mode with given context length. The length of error context is set to number_of_chars.

Download

 

For questions or comments, please contact Yuhui Jin (yhjin@db.stanford.edu) or Juan Fernando Arguello (jarguell@db.stanford.edu). Last modified May 27th, 2015.