Problem/Question/Abstract: Three Classes for Grabbing HTML/XML Information Answer: I recently bought one of the new digital satellite dishes and ran across an interesting challenge - figuring out just what was on, and when. DirectTV provided a Web site with a basic search engine, but the Web-based search was slow and very narrowly focused. As a typical programmer, I knew I could provide a better, more powerful UI, if I could just figure out how their Web search engine worked. A quick scan of the HTML source pointed out a relatively simple search form that I could easily duplicate, but the HTML results came back in a mildly complicated HTML table. Brute force code would have been simple enough to construct to parse through the table, but I'd been looking for a reason to build a more general parser, so off I went. If I'd known just how lax the HTML rules are, and just how many hacks there are, I'd have just stuck with the brute force method and saved myself a lot of agony, but since I'm here now ... The Basics To put together a parser for HTML, an understanding of the rules is required. HTML originated as an SGML-like syntax, and over time has grown to fit more closely within the confines of SGML. These days the syntax is described within an SGML Data Type Definition (DTD) bringing it into a reasonably well-understood and managed domain. Given that SGML now establishes the underpinnings of HTML, the parser should apply the SGML syntax rules as a starting point. This also allows for the consideration of some simple extensions that allow parsing of XML. Therefore, the parser is built to work on SGML in general, with specific handlers for exceptions and extensions that occur in HTML and XML. The rules for SGML are straightforward, and provide five basic constructs that we care about: elements, attributes, comments, SGML directives, and "everything else" Elements are the facet of the SGML content with which we are most concerned, and around which the parser is established. Elements have a start tag, content, and an end tag, for example: HTML Parsing where TITLE is considered to be the "name" of the tag. Element names are case-insensitive. So we start with the following parsing rules: Element start and end tags are surrounded by < and > characters. Element end tags are denoted by the / character immediately following a <. Element content is surrounded by start and end tags. HTML Extensions In HTML, we immediately note that there are exceptions to these rules. For some elements - most notably
- the end tags may be omitted, even though the element may have contents. This offers perhaps the most annoying challenge of the HTML parsing rule set, because there are several methods by which the element may be terminated.
To start with, we note another syntax rule from SGML: elements may not span. That is, if an element's start tag is contained within another element's start and end tags, its end tag must also appear there. Put simply, if we encounter an end tag, all omitted end tags are considered "closed" back up to the matching start tag. Also, by observation (I couldn't find a formal rule for this in the HTML specification), virtually all elements with optional end tags close when they encounter another of themselves, and being fine examples. Further, the HTML reference material does indicate that
elements that omit their end tags are terminated by "block elements." The HTML DTD must be consulted to determine which elements are considered block elements. Unfortunately, all of this prevents us from using a general rule, and requires that we become concerned with the HTML DTD.
A quick consideration of the DTD is therefore in order. The DTD calls out which elements must have end tags, and which may omit (or are forbidden to have) end tags. For example, the DTD fragment for
The important thing to note here is - O. The - indicates the start tag is required, and O means the end tag may be omitted. Compare this to the fragment for : where - O EMPTY indicates that the start tag is required. Since the element is EMPTY, however, the optional end tag is now expressly forbidden. In the fragment for
, the (%inline;)* shows the legal contents of the P element. In this case, %inline; refers to a list of elements defined earlier in the DTD. Notably missing from the %inline; list is
itself. A perusal of the other ambiguous elements reinforces the observation that in general, an element may not immediately contain itself (although this ambitiously general rule is certainly not guaranteed to remain valid for future releases of the HTML DTD). In the same way that %inline; is defined, so there exists a list name %block; which contains the list of block elements.
This leads to another set of parsing rules:
with an omitted end tag is terminated when a block element is encountered.
Elements are terminated when another element of the same name is encountered. Elements are terminated if a parent's end tag is encountered; no spans are allowed. Attributes Attributes represent the various properties of elements. By definition, attributes appear in name/value pairs within the start tag of the element. For example, in: the BODY element has an attribute name bgcolor, and the attribute has a value of #FFFFFF. Double quotation marks or single quotation marks are required to delimit the value, unless the value contains only letters, digits, hyphens, and periods. If the value contains single quotation marks, it should be delimited with double quotation marks, and vice versa. Attribute names are case-insensitive. Also worth noting is that not all attributes have a value. For instance, the NOWRAP attribute of the element. Attributes appear within an element's start tag Attributes are delimited by a space character (ASCII 32) Attribute values are delimited by " or ' Comments SGML also provides that its content may include comments. Comments are of the form: The ). Further, a comment may contain < and > characters. Comments may not include other comments. terminates a comment. < and > are ignored while parsing a comment. The remaining SGML directives are denoted by the beginning markup declaration open delimiter Comments within the directives are delimited by -- Lastly, we consider what remains. Content of elements not contained within a start tag, end tag, or comment is considered by the parser to be PCData (parsed character data). Store text not included in element start/end tags or comments as PCData. XML Extensions As mentioned before, XML is also derived from SGML. While HTML is basically a DTD described within and using SGML, XML is a subset of SGML capable both of representing data and containing other DTDs of its own. XML also demonstrates that those working on the standards in the programming community actually do learn from the mistakes of those that went before them. For instance, the rules of containment are much more formal in XML than they are in HTML, making parsing a great deal simpler. This means that while a DTD may be included in an XML document for syntax checking purposes, it isn't necessarily required for the actual parsing of the XML content, as it is for HTML. Knowing this, we can add two more rules and provide DTD-less XML parsing as well. For one, empty elements in HTML are simply called out as such in the DTD with their end tags forbidden ( for example). If an element in XML is to be empty (that is, it will have no content) its start tag may have a / just before the closing > indicating that no content and no end tag will follow. Additionally, XML directives may appear with the ? character rather than !. Empty elements in XML may be terminated by a / just before the > in the start tag, e.g. . Additional directives appear using ?, instead of the ! character. Everything Else Any items encountered in the content that are not contained in element start or end tags, comments, or DTD items are considered by the parser to be PCData. The content of elements fits this bill, as do carriage returns and line feeds encountered by the parser. This leads to the final parsing rule: Any content located outside of start/end tags, comments, or DTD items is PCData. Additional Considerations It is also worth noting that syntax errors and occurrences of "browser-tolerated" HTML inconsistencies are frequently encountered, and as such, should not raise exceptions except in extreme cases. Instead, a warning should be noted and parsing should continue if possible. The Parser The goal of parsing the SGML content is to place the data it represents into a form more readily accessible to other components. Since SGML calls out a hierarchical structure a hierarchy is probably the most accurate way to store the parsed content. With that in mind, the parser is built from two primary classes, and a third supporting class. First and foremost is the THTMLParser class. Its Parse method accepts the content to be parsed, and places the processed results in the Tree property. Next is the TTagNode class in which the parsed results are contained. This class is a hierarchical storage container with Parent pointing to the TTagNode that contains the current node, and Children containing a list of children immediately owned by the current node. TTagNodeList is provided as a list container for a collection of TTagNode objects, typically produced by a call to the GetTags method of the TTagNode class. A Simple Example Consider the sample HTML shown in Figure 1. The parser would produce from the HTML a hierarchy that can be visualized as in Figure 2. Each of the boxes in the tree represents a TTagNode instance. Example HTML Example HTML Plain old text right out there in the middle of the document.
Text contained within in paragraph
Unordered List Figure 1: Sample HTML. Figure 2: Hierarchical representation of parsed HTML. Each node has a NodeType property that indicates what type of node it is. All node types except ntePCData also have text in the Caption property that provides more information about the node's contents. See the table in Figure 3 for details. Node Contents NodeType Caption HTML elements nteElement Element name Text ntePCData Comments nteComment ! SGML/XML/DTD directives nteDTDItem ! or ? and directive Figure 3: TTagNode node types. For example, the HTML node in Figure 2 has a NodeType of nteElement, while the comment tag is of type nteComment, and the PCData nodes are of type ntePCData. The content or text for each HTML element is contained in a node in its Children list. For example, the TITLE node in Figure 2 has a PCData node whose Text property contains "Example HTML". The GetPCData method of a TTagNode returns the PCData text for all children of the node for which it's called. Note, this method is recursive and will return the PCData text for all nodes in the tree beneath the node upon which it's called. Retrieving Elements The GetTags method of a TTagNode will return a list of all children that match an element name. If '*' or '' is specified as the element name, then all children will be returned. Note that this method is recursive. The result list is a TTagNodeList. The code in Figure 4 illustrates how the GetTags method is used to collect a list of all elements in the HTML from Figure 1 and insert their contents into a list box. The process is as follows: Create a container for the results, i.e. the Elements list. Call the GetTags method, passing the desired element name and the container. Iterate through the container, placing the text for each element in a list box. Destroy the container. procedure Button1OnClick(Sender: TObject); var Elements: TTagNodeList; Counter: Integer; begin Elements := TTagNodeList.Create; HTMLParser1.Tree.GetTags('LI', Elements); for Counter := 0 to Elements.Count - 1 do ListBox1.Items.Add(Elements[Counter].GetPCData); Elements.Free; end; Figure 4: Using GetTags. TTagNodeList has two methods that offer another approach (assuming that a result set has already been acquired): GetTagCount and FindTagByIndex. GetTagCount returns a count of the occurrences of an element name. FindTagByIndex returns the index within the list of the specified occurrence of an element name. For instance, the statement: ShowMessage(Elements.FindTagByIndex('li', 1).GetPCData); were it included in Figure 4, would display the text for the second occurrence of the element in the Elements container. This can prove exceptionally useful for locating a specific tag from target HTML content. For example, if the third
in the HTML contained the desired data, the following code would make quick work of locating the root
HTMLParser1.GetTags('*', Elements); Node := Elements.FindTagByIndex('table', 2); if Assigned(Node) then begin // Perform processing on
end; Working with Results as a Hierarchy The procedure in Figure 5 provides yet another method for accessing the contents of the Tree (albeit a more brute force approach). In this example, the HTML content is assumed to be fairly straightforward:
elements, which contain and elements. Given this reasonably accurate assumption, the code will walk the children of a
node, and for each
node found will walk its children looking for occurrences of either or , and add their text (contained in PCData nodes) to a TStrings container, e.g. the Lines property of a TMemo control. // Return TableNode's contents in a TStrings container. procedure GetTable(TableNode: TTagNode; Lines: TStrings); var RowCtr, DataCtr: Integer; Node, RowNode: TTagNode; TempStr: string; begin Lines.Clear; if CompareText(TableNode.Caption, 'table') = 0 then begin for RowCtr := 0 to TableNode.ChildCount - 1 do begin RowNode := TableNode.Children[RowCtr]; if CompareText(RowNode.Caption, 'tr') = 0 then begin TempStr := ''; for DataCtr := 0 to RowNode.ChildCount - 1 do begin Node := RowNode.Children[DataCtr]; if CompareText(Node.Caption, 'td') = 0 then TempStr := TempStr + Node.GetPCData + #9 else if CompareText(Node.Caption, 'th') = 0 then TempStr := TempStr + Node.GetPCData + #9; end; TempStr := Trim(TempStr); if TempStr <> '' then Lines.Add(TempStr); end; end; end; end; Figure 5: Working with results as a hierarchy. As an illustration of just how difficult HTML processing can be, the following caveats apply to the code provided in Figure 5 and would have to be handled to provide a robust solution: Subtables are not handled. That is,
elements encountered within a
<< Back to main page