Problem/Question/Abstract: Three Classes for Grabbing HTML/XML Information Answer: I recently bought one of the new digital satellite dishes and ran across an interesting challenge - figuring out just what was on, and when. DirectTV provided a Web site with a basic search engine, but the Web-based search was slow and very narrowly focused. As a typical programmer, I knew I could provide a better, more powerful UI, if I could just figure out how their Web search engine worked. A quick scan of the HTML source pointed out a relatively simple search form that I could easily duplicate, but the HTML results came back in a mildly complicated HTML table. Brute force code would have been simple enough to construct to parse through the table, but I'd been looking for a reason to build a more general parser, so off I went. If I'd known just how lax the HTML rules are, and just how many hacks there are, I'd have just stuck with the brute force method and saved myself a lot of agony, but since I'm here now ... The Basics To put together a parser for HTML, an understanding of the rules is required. HTML originated as an SGML-like syntax, and over time has grown to fit more closely within the confines of SGML. These days the syntax is described within an SGML Data Type Definition (DTD) bringing it into a reasonably well-understood and managed domain. Given that SGML now establishes the underpinnings of HTML, the parser should apply the SGML syntax rules as a starting point. This also allows for the consideration of some simple extensions that allow parsing of XML. Therefore, the parser is built to work on SGML in general, with specific handlers for exceptions and extensions that occur in HTML and XML. The rules for SGML are straightforward, and provide five basic constructs that we care about: elements, attributes, comments, SGML directives, and "everything else" Elements are the facet of the SGML content with which we are most concerned, and around which the parser is established. Elements have a start tag, content, and an end tag, for example: HTML Parsing where TITLE is considered to be the "name" of the tag. Element names are case-insensitive. So we start with the following parsing rules: Element start and end tags are surrounded by < and > characters. Element end tags are denoted by the / character immediately following a <. Element content is surrounded by start and end tags. HTML Extensions In HTML, we immediately note that there are exceptions to these rules. For some elements - most notably - the end tags may be omitted, even though the element may have contents. This offers perhaps the most annoying challenge of the HTML parsing rule set, because there are several methods by which the element may be terminated. To start with, we note another syntax rule from SGML: elements may not span. That is, if an element's start tag is contained within another element's start and end tags, its end tag must also appear there. Put simply, if we encounter an end tag, all omitted end tags are considered "closed" back up to the matching start tag. Also, by observation (I couldn't find a formal rule for this in the HTML specification), virtually all elements with optional end tags close when they encounter another of themselves,
and being fine examples. Further, the HTML reference material does indicate that elements that omit their end tags are terminated by "block elements." The HTML DTD must be consulted to determine which elements are considered block elements. Unfortunately, all of this prevents us from using a general rule, and requires that we become concerned with the HTML DTD. A quick consideration of the DTD is therefore in order. The DTD calls out which elements must have end tags, and which may omit (or are forbidden to have) end tags. For example, the DTD fragment for
is: The important thing to note here is - O. The - indicates the start tag is required, and O means the end tag may be omitted. Compare this to the fragment for : where - O EMPTY indicates that the start tag is required. Since the element is EMPTY, however, the optional end tag is now expressly forbidden. In the fragment for
, the (%inline;)* shows the legal contents of the P element. In this case, %inline; refers to a list of elements defined earlier in the DTD. Notably missing from the %inline; list is
itself. A perusal of the other ambiguous elements reinforces the observation that in general, an element may not immediately contain itself (although this ambitiously general rule is certainly not guaranteed to remain valid for future releases of the HTML DTD). In the same way that %inline; is defined, so there exists a list name %block; which contains the list of block elements. This leads to another set of parsing rules:
with an omitted end tag is terminated when a block element is encountered. Elements are terminated when another element of the same name is encountered. Elements are terminated if a parent's end tag is encountered; no spans are allowed. Attributes Attributes represent the various properties of elements. By definition, attributes appear in name/value pairs within the start tag of the element. For example, in:
the BODY element has an attribute name bgcolor, and the attribute has a value of #FFFFFF. Double quotation marks or single quotation marks are required to delimit the value, unless the value contains only letters, digits, hyphens, and periods. If the value contains single quotation marks, it should be delimited with double quotation marks, and vice versa. Attribute names are case-insensitive. Also worth noting is that not all attributes have a value. For instance, the NOWRAP attribute of the element. Attributes appear within an element's start tag Attributes are delimited by a space character (ASCII 32) Attribute values are delimited by " or ' Comments SGML also provides that its content may include comments. Comments are of the form: The ). Further, a comment may contain < and > characters. Comments may not include other comments. terminates a comment. < and > are ignored while parsing a comment. The remaining SGML directives are denoted by the beginning markup declaration open delimiter Comments within the directives are delimited by -- Lastly, we consider what remains. Content of elements not contained within a start tag, end tag, or comment is considered by the parser to be PCData (parsed character data). Store text not included in element start/end tags or comments as PCData. XML Extensions As mentioned before, XML is also derived from SGML. While HTML is basically a DTD described within and using SGML, XML is a subset of SGML capable both of representing data and containing other DTDs of its own. XML also demonstrates that those working on the standards in the programming community actually do learn from the mistakes of those that went before them. For instance, the rules of containment are much more formal in XML than they are in HTML, making parsing a great deal simpler. This means that while a DTD may be included in an XML document for syntax checking purposes, it isn't necessarily required for the actual parsing of the XML content, as it is for HTML. Knowing this, we can add two more rules and provide DTD-less XML parsing as well. For one, empty elements in HTML are simply called out as such in the DTD with their end tags forbidden ( for example). If an element in XML is to be empty (that is, it will have no content) its start tag may have a / just before the closing > indicating that no content and no end tag will follow. Additionally, XML directives may appear with the ? character rather than !. Empty elements in XML may be terminated by a / just before the > in the start tag, e.g. . Additional directives appear using ?, instead of the ! character. Everything Else Any items encountered in the content that are not contained in element start or end tags, comments, or DTD items are considered by the parser to be PCData. The content of elements fits this bill, as do carriage returns and line feeds encountered by the parser. This leads to the final parsing rule: Any content located outside of start/end tags, comments, or DTD items is PCData. Additional Considerations It is also worth noting that syntax errors and occurrences of "browser-tolerated" HTML inconsistencies are frequently encountered, and as such, should not raise exceptions except in extreme cases. Instead, a warning should be noted and parsing should continue if possible. The Parser The goal of parsing the SGML content is to place the data it represents into a form more readily accessible to other components. Since SGML calls out a hierarchical structure a hierarchy is probably the most accurate way to store the parsed content. With that in mind, the parser is built from two primary classes, and a third supporting class. First and foremost is the THTMLParser class. Its Parse method accepts the content to be parsed, and places the processed results in the Tree property. Next is the TTagNode class in which the parsed results are contained. This class is a hierarchical storage container with Parent pointing to the TTagNode that contains the current node, and Children containing a list of children immediately owned by the current node. TTagNodeList is provided as a list container for a collection of TTagNode objects, typically produced by a call to the GetTags method of the TTagNode class. A Simple Example Consider the sample HTML shown in Figure 1. The parser would produce from the HTML a hierarchy that can be visualized as in Figure 2. Each of the boxes in the tree represents a TTagNode instance. Example HTML Example HTML Plain old text right out there in the middle of the document. Text contained within in paragraph
Unordered List Figure 1: Sample HTML. Figure 2: Hierarchical representation of parsed HTML. Each node has a NodeType property that indicates what type of node it is. All node types except ntePCData also have text in the Caption property that provides more information about the node's contents. See the table in Figure 3 for details. Node Contents NodeType Caption HTML elements nteElement Element name Text ntePCData Comments nteComment ! SGML/XML/DTD directives nteDTDItem ! or ? and directive Figure 3: TTagNode node types. For example, the HTML node in Figure 2 has a NodeType of nteElement, while the comment tag is of type nteComment, and the PCData nodes are of type ntePCData. The content or text for each HTML element is contained in a node in its Children list. For example, the TITLE node in Figure 2 has a PCData node whose Text property contains "Example HTML". The GetPCData method of a TTagNode returns the PCData text for all children of the node for which it's called. Note, this method is recursive and will return the PCData text for all nodes in the tree beneath the node upon which it's called. Retrieving Elements The GetTags method of a TTagNode will return a list of all children that match an element name. If '*' or '' is specified as the element name, then all children will be returned. Note that this method is recursive. The result list is a TTagNodeList. The code in Figure 4 illustrates how the GetTags method is used to collect a list of all elements in the HTML from Figure 1 and insert their contents into a list box. The process is as follows: Create a container for the results, i.e. the Elements list. Call the GetTags method, passing the desired element name and the container. Iterate through the container, placing the text for each element in a list box. Destroy the container. procedure Button1OnClick(Sender: TObject); var Elements: TTagNodeList; Counter: Integer; begin Elements := TTagNodeList.Create; HTMLParser1.Tree.GetTags('LI', Elements); for Counter := 0 to Elements.Count - 1 do ListBox1.Items.Add(Elements[Counter].GetPCData); Elements.Free; end; Figure 4: Using GetTags. TTagNodeList has two methods that offer another approach (assuming that a result set has already been acquired): GetTagCount and FindTagByIndex. GetTagCount returns a count of the occurrences of an element name. FindTagByIndex returns the index within the list of the specified occurrence of an element name. For instance, the statement: ShowMessage(Elements.FindTagByIndex('li', 1).GetPCData); were it included in Figure 4, would display the text for the second occurrence of the element in the Elements container. This can prove exceptionally useful for locating a specific tag from target HTML content. For example, if the third in the HTML contained the desired data, the following code would make quick work of locating the root node: HTMLParser1.GetTags('*', Elements); Node := Elements.FindTagByIndex('table', 2); if Assigned(Node) then begin // Perform processing on node. end; Working with Results as a Hierarchy The procedure in Figure 5 provides yet another method for accessing the contents of the Tree (albeit a more brute force approach). In this example, the HTML content is assumed to be fairly straightforward: elements contain elements, which contain and elements. Given this reasonably accurate assumption, the code will walk the children of a node, and for each node found will walk its children looking for occurrences of either or , and add their text (contained in PCData nodes) to a TStrings container, e.g. the Lines property of a TMemo control. // Return TableNode's contents in a TStrings container. procedure GetTable(TableNode: TTagNode; Lines: TStrings); var RowCtr, DataCtr: Integer; Node, RowNode: TTagNode; TempStr: string; begin Lines.Clear; if CompareText(TableNode.Caption, 'table') = 0 then begin for RowCtr := 0 to TableNode.ChildCount - 1 do begin RowNode := TableNode.Children[RowCtr]; if CompareText(RowNode.Caption, 'tr') = 0 then begin TempStr := ''; for DataCtr := 0 to RowNode.ChildCount - 1 do begin Node := RowNode.Children[DataCtr]; if CompareText(Node.Caption, 'td') = 0 then TempStr := TempStr + Node.GetPCData + #9 else if CompareText(Node.Caption, 'th') = 0 then TempStr := TempStr + Node.GetPCData + #9; end; TempStr := Trim(TempStr); if TempStr <> '' then Lines.Add(TempStr); end; end; end; end; Figure 5: Working with results as a hierarchy. As an illustration of just how difficult HTML processing can be, the following caveats apply to the code provided in Figure 5 and would have to be handled to provide a robust solution: Subtables are not handled. That is, elements encountered within a element are ignored. Row and column spanning is not handled. , and a host of other table elements are not considered (although admittedly they are rare). Working with Results as a List The procedure in Figure 6 demonstrates working with the parsed results as a list. The goal here is to retrieve a list of all comments, links, meta tags, and images from the document, with the thrown in for good measure. The code does this in several simple steps: Parse the HTML. Create a container for the list of matching nodes from the tree. Call the GetTags method passing '*', and the container ('*' indicates that all items in the tree should be returned). Iterate through the container collecting matches and place their contents in the StringList. Destroy the container. procedure TForm1.Parse(HTML: string; Lines: TStrings); const cKnownTags = '|title|img |a |meta |! '; cTITLE = 0; cIMG = 1; cA = 2; cMETA = 3; cComment = 4; var Index, Counter: Integer; TempStr: string; Nodes: TTagNodeList; begin HTMLParser1.Parse(HTML); Nodes := TTagNodeList.Create; // Retrieve all nodes. HTMLParser1.Tree.GetTags('*', Nodes); for Counter := 0 to Nodes.Count - 1 do begin TempStr := '|' + LowerCase(Nodes[Counter].Caption); // Index of element name. Index := Pos(TempStr, cKnownTags); if Index > 0 then begin Index := Index div 6; case Index of cTITLE: Lines.Add('Title=' + HTMLDecode(Nodes[Counter].GetPCData)); cIMG: begin TempStr := Nodes[Counter].Params.Values['src']; if TempStr <> '' then Lines.Add( Nodes[Counter].Params.Values['src']); end; cA: begin TempStr := Nodes[Counter].Params.Values['href']; if TempStr <> '' then Lines.Add(TempStr + '=' + HTMLDecode(Nodes[Counter].GetPCData)); end; cMETA: with Nodes[Counter].Params do Lines.Add(Values['name'] + '=' + Values['content']); cComment: Lines.Add('[Comment] ' + HTMLDecode(Nodes[Counter].Text)); end; { case Index } end; { if Index > 0 } end; { for Counter := 0 to Nodes.Count - 1 } Nodes.Free; end; Figure 6: Working with results as a list. The important thing to understand here is that the TTagNodeList class is just a list of pointers to nodes from the Tree. This is quite beneficial in that once a desired node is located in the list, it may be used as if it had been acquired by traversing the tree. For example, when the case statement in Figure 6 encounters a TITLE element, its contents are retrieved by making a call to the TITLE node's GetPCData method (which depends on the parsed tree structure behaving as it appears to in Figure 2). Note that the PCData often contains encoded items such as > and < (< and > respectively). HTMLDecode is provided for handling most simple cases, but doesn't handle all cases (notably non-US character encoding). This example also demonstrates the use of the attributes from an element. When an element is encountered, the HREF attribute is examined. If it exists, the element is treated as a link to some other resource. If the HREF attribute were not specified, this might be an instance of serving as an anchor instead of a link. For more details on how the Params property of the TTagNode behaves, see the Delphi help for the Names and Values properties of the TStrings class. Searching the www.directv.com Program Guide Applying the HTML parser to the original need turns out to be another simple (albeit involved) exercise. First, an understanding of the CGI scripts that allow searching of the program guide is required. As it turns out, the search script is rather crude and accepts only three parameters: timezone, category, and search text. timezone is simply a number representing Eastern, Central, Mountain, or Pacific. category allows the search to span all programs, or to be narrowed to certain types of programs such as movies or sports. The search text should be all, or a portion of, the desired program name. The search is specific to program names, and doesn't consider program descriptions. The results of the search are returned as an HTML table including channel, date, time, duration, and program name. The program name is contained within a link to the program description, which will need to be retrieved as well. This is slightly complicated by the fact that the link provided is written using JavaScript which we cannot simply call. However, the URL produced by the JavaScript function is easy to replicate, as it contains a program ID number that can be passed to another CGI script that returns the desired description. The next step is to parse the HTML and process the results into a more useful format. TStringTable is provided as a simple container for just this purpose. The TStringTable offers a non-visual equivalent to TStringGrid with a few additional methods to make manipulating the data a bit easier. Once the HTML table has been processed and placed in the string table, a bit of house cleaning is required. For one, there are rows at the end of the HTML table that need to be ignored, as they contain images, not program content. Also, the channel appears only in the first row of a set of programs that occur on that channel. The contents can now be added to a ListView. Once that task is complete, the descriptions can be fetched using the program IDs to call the description CGI, and then added to the ListView. Further Study While these examples are not terribly glamorous, the parser can be applied to more meaningful problems. For instance, a friend of mine has put together an extremely handy application using the Pricewatch site (http://www.pricewatch.com/) to monitor prices on PC hardware. Pricewatch offers a current snapshot of pricing of a particular piece of hardware from various vendors, usually sorted from least expensive to most. However, it doesn't allow for viewing of several different pieces of hardware at once, and it doesn't track the history of the price changes for the hardware. So, the application provides a way to build a list of hardware to be tracked, and then offers a simple trend analysis by gathering and saving off the price information on a regular basis. This provides a useful picture for the consumer of just how quickly the prices are moving downward on a particular item. If the pricing is in a steep downward curve, waiting to purchase might be wise. If the curve is flat, the time to purchase might be at hand. The parser is used not only to retrieve the pricing data, but also to help deal with one of the more significant issues facing those attempting to interface with sites they do not control: unexpected changes in the target site's contents. In this case, a review of the
<< Back to main page