Wednesday, May 10, 2006

Scripting: "Print this page"

The "Print this page" functionality I developed for the Web site recently broke with the introduction of a new template. The main reason was my reliance on a regular expression to parse the document for the sections that would need to be output on a printer-friendly version of the page. The new template was missing text I had been using delineate these sections.

Unfotutnately, to accommodate the new templates (and any future additions) would have required writing a different regular expression for each template because there was no other text delimeters unique enough to work for all templates. I like to keep my programming as generic as possible so I decided to use the HTML::Parser module to pull the various sections from the document. This particular object will work fine for all pages using the AAAS template since I can rely on tags common to that template rather than text specific to any sub-templates being used.

It took a bit of trial and error to get things to work but I think I'm finally starting to get the hang of how the module works. The basic procedure is to find a tag of interest such as a div tag with the attribute id="contentBox" (which contains the document content) and then add handlers to the parser to grab all text contained within that element. You can add and remove handlers at will as the parser runs, making it easy to catch blocks of text in one pass.

The only thing I'm not sure of at present is if it's possible to capture sub-blocks easily. Thinking about it briefly, though, leads me to think that by assigning the desired output to a variable and then parsing that variable for the sub-blocks would be a feasible method.