Xlint - an Error-Tolerant XML Parser
According to the Extensible Markup Language (XML) specification, violations of well-formedness constraints are fatal errors - once a fatal error is detected, the XML processor must not continue normal processing. Due to this requirement, all existing XML parsers will stop right after the first well-formedness error is detected. To be able to use an XML document for any purpose, we need to first remove all the well-formedness errors. Using a conventional parser to detect and fix the errors causes us to repeatedly run the parser to detect and remove each error. This is very inefficient for a huge XML document with many errors (some of them may be systematic errors caused by improper global replacements). A desirable parser should never stop and report all the well-formedness errors during or after a complete parse of the XML document. The Xlint is such an error-tolerant XML parser to facilitate the error removal of large XML documents. Xlint is implemented using Perl.
Based on the XML document structure, we classify the errors in a non-well-formed XML document into the following two types:
(1) Syntax errors, including:
Syntax error for the xml declaration;
Expect the attribute for version info;
Invalid version number assignment;
Invalid encoding name assignment;
Invalid assignment for standalone document declaration;
The specified attribute was not expected at this location;
Syntax error for the tag (such as missing the end bracket);
Duplicate DocType declaration;
Syntax error for the comment;
Syntax error for the processing instruction;
Expect white space.
(2) Structural errors, including:
Missing the start tag;
Missing the end tag.
Xlint handles the first type of error in a recursive descent fashion, parsing all the constructs according to the grammar. The second type of error is handled by creating a tag-stack and explicitly manipulating the tags encountered in the XML document. For more detail, please refer to the Xlint document.
The Xlint is executed by the command:
perl xlint.pl <file_name> [-v |-v <number_of_chars>]
You must supply the “file_name” parameter which is the absolute or relative file name of the XML document to be parsed. Followed by the “file_name” you can use the optional parameters “-v” or “-v number_of_chars”:
-v |
The verbose mode with default context length. A context of 30 characters around the error position is displayed. |
-v number_of_chars |
The verbose mode with given context length. The length of error context is set to number_of_chars. |
For questions or comments, please contact Yuhui Jin (yhjin@db.stanford.edu) or Juan Fernando Arguello (jarguell@db.stanford.edu). Last modified May 27th, 2015.