Programming Documents: PHP and SAX

I l@ve RuBoard

PHP and SAX

PHP 4.0 comes with a very capable SAX parser based on the expat library. Created by James Clark, the expat library is a fast, robust SAX implementation that provides XML parsing capabilities to a number of open-source projects, including the Mozilla browser (http://www.mozilla.org/).

If you're using a stock PHP binary, it's quite likely that you'll need to recompile PHP to add support for this library to your PHP build. Detailed instructions for accomplishing this are available in Appendix A, "Recompiling PHP to Add XML Support."

A Simple Example

You can do a number of complex things with a SAX parser; however, I'll begin with something simple to illustrate just how it all fits together. Let's go back to the previous XML document (see Listing 2.1), and write some PHP code to process this document and do something with the data inside it (see Listing 2.2).

Listing 2.2 Generic PHP-Based XML Parser


<html> 
<head> 
<basefont face="Arial"> 
</head> 
<body> 

<?php 
// XML data file 
$xml_file = "fox.xml"; 

// initialize parser 
$xml_parser = xml_parser_create(); 

// set callback functions 
xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); 
xml_set_character_data_handler($xml_parser, "characterDataHandler"); 

// read XML file 
if (!($fp = fopen($xml_file, "r"))) 
{
      die("File I/O error: $xml_file"); 
}   
// parse XML 
while ($data = fread($fp, 4096)) 
{
      // error handler 
      if (!xml_parse($xml_parser, $data, feof($fp))) 
      {
            die("XML parser error: " . 
xml_error_string(xml_get_error_code($xml_parser))); 
      } 
} 

// all done, clean up! 
xml_parser_free($xml_parser); 
?> 
</body> 
</html>

I'll explain Listing 2.2 in detail:

The first order of business is to initialize the SAX parser. This is accomplished via PHP's aptly named xml_parser_create() function, which returns a handle for use in successive operations involving the parser.
```
$xml_parser = xml_parser_create(); 
```

With the parser created, it's time to let it know which events you would like it to monitor, and which user-defined functions (or callback functions) it should call when these events occur. For the moment, I'm going to restrict my activities to monitoring start tags, end tags, and the data embedded within them:


xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); 
xml_set_character_data_handler($xml_parser, "characterDataHandler");

Speaking Different Tongues

It's possible to initialize the parser with a specific encoding. For example:


$xml_parser = xml_parser_create("UTF-8");

PHP's SAX parser currently supports the following encodings:

ISO-8859-1

US-ASCII

UTF-8

An attempt to use an unsupported encoding will result in a slew of ugly error messages. Try it yourself to see what I mean.

What have I done here? Very simple. I've told the parser to call the function startElementHandler() when it finds an opening tag, the function endElementHandler() when it finds a closing tag, and the function characterDataHandler() whenever it encounters character data within the document.

When the parser calls these functions, it will automatically pass them all relevant information as function arguments. Depending on the type of callback registered, this information could include the element name, element attributes, character data, processing instructions, or notation identifiers.

From Listing 2.2, you can see that I haven't defined these functions yet; I'll do that a little later, and you'll see how this works in practice. Until these functions have been defined, any attempt to run the code from Listing 2.2 as it is right now will fail.

Now that the callback functions have been registered, all that remains is to actually parse the XML document. This is a simple exercise. First, create a file handle for the document:
```
if (!($fp = fopen($xml_file, "r"))) 
{
      die("File I/O error: $xml_file"); 
} 
```
Then, read in chunks of data with fread(), and parse each chunk using the xml_parse() function:
```
while ($data = fread($fp, 4096)) 
{
    // error handler 
      if (!xml_parse($xml_parser, $data, feof($fp))) 
      {
            die("XML parser error: " . 
xml_error_string(xml_get_error_code($xml_parser))); 
      } 
} 
```
In the event that errors are encountered while parsing the document, the script will automatically terminate via PHP's die() function. Detailed error information can be obtained via the xml_error_string() and xml_get_error_code() functions (for more information on how these work, see the "Handling Errors" section).

After the complete file has been processed, it's good programming practice to clean up after yourself by destroying the XML parser you created:
```
xml_parser_free($xml_parser); 
```
That said, in the event that you forget, PHP will automatically destroy the parser for you when the script ends.

Endgame

You already know that SAX can process XML data in chunks, making it possible to parse XML documents larger than available memory. Ever wondered how it knows when to stop?

That's where the optional third parameter to xml_parse() comes in. As each chunk of data is read from the XML file, it is passed to the xml_parse() function for processing. When the end of the file is reached, the feof() function returns true, which tells the parser to stop and take a well-deserved break.

The preceding four steps make up a pretty standard process, and you'll find yourself using them over and over again when processing XML data with PHP's SAX parser. For this reason, you might find it more convenient to package them as a separate function, and call this function wherever required�a technique demonstrated in Listing 2.23.

With the generic XML processing code out of the way, let's move on to the callback functions defined near the top of the script.You'll remember that I registered the following three functions:

startElementHandler()�
Executed when an opening tag is encountered

endElementHandler()�
Executed when a closing tag is encountered

characterDataHandler()�
Executed when character data is encountered

Listing 2.3 is the revised script with these handlers included.

Listing 2.3 Defining SAX Callback Functions


<html> 
<head> 
<basefont face="Arial"> 
</head> 
<body> 
<?php 

// run when start tag is found 
function startElementHandler($parser, $name, $attributes) 
{
      echo "Found opening tag of element: <b>$name</b> <br>"; 

      // process attributes 
      while (list ($key, $value) = each ($attributes)) 
      {
            echo "Found attribute: <b>$key = $value</b> <br>"; 
      } 
} 

// run when end tag is found 
function endElementHandler($parser, $name) 
{
      echo "Found closing tag of element: <b>$name</b> <br>"; 
} 
// run when cdata is found 
function characterDataHandler($parser, $cdata) 
{
      echo "Found CDATA: <i>$cdata</i> <br>"; 
} 

// XML data file 
$xml_file = "fox.xml"; 

// initialize parser 
$xml_parser = xml_parser_create(); 

// set callback functions 
xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); 
xml_set_character_data_handler($xml_parser, "characterDataHandler"); 

// read XML file 
if (!($fp = fopen($xml_file, "r"))) 
{
      die("File I/O error: $xml_file"); 
} 

// parse XML 
while ($data = fread($fp, 4096)) 
{
    // error handler 
      if (!xml_parse($xml_parser, $data, feof($fp))) 
      {
            die("XML parser error: " . 
xml_error_string(xml_get_error_code($xml_parser))); 
      } 
} 

// all done, clean up! 
xml_parser_free($xml_parser); 

?> 
</body> 
</html>

Nothing too complex here. The tag handlers print the names of the tags they encounter, whereas the character data handler prints the data enclosed within the tags. Notice that the startElementHandler() function automatically receives the tag name and attributes as function arguments, whereas the characterDataHandler() gets the CDATA text.

And when you execute the script through a browser, here's what the end product looks like (and if you're wondering why all the element names are in uppercase, take a look at the "Controlling Parser Behavior" section):


Found opening tag of element: SENTENCE 
Found CDATA: The 
Found opening tag of element: ANIMAL 
Found attribute: COLOR = blue 
Found CDATA: fox 
Found closing tag of element: ANIMAL 
Found CDATA: leaped over the 
Found opening tag of element: VEGETABLE 
Found attribute: COLOR = green 
Found CDATA: cabbage 
Found closing tag of element: VEGETABLE 
Found CDATA: patch and vanished into the darkness. 
Found closing tag of element: SENTENCE

Not all that impressive, certainly�but then again, we're just getting started!

I l@ve RuBoard

Programming Documents

Monday, November 2, 2009

PHP and SAX

PHP and SAX

A Simple Example

Listing 2.2 Generic PHP-Based XML Parser

Speaking Different Tongues

Endgame

Listing 2.3 Defining SAX Callback Functions

No comments:

Post a Comment

Blog Archive

About Me

Followers

Link