About: This is text data from the Wall Street Journal. It was parsed for sentence breaks, punctuation, abbreviations, capitalization, etc. In wsj.xml, each element S represents a sentence, and words/punctuation of the sentence are listed in order as subelements T. The file wsj2.xml is a reformatting of the same data, where each element S represents a sentence, and consecutive words are denoted by the elements T and W. For example, the sentence "Roses are red." would be represented by the element Please don't ask why it's all weird like that.