Monday, November 2, 2009

PHP and SAX




I l@ve RuBoard










PHP and SAX


PHP 4.0 comes with a very capable SAX parser based on the expat library. Created by James Clark, the expat library is a fast, robust SAX implementation that provides XML parsing capabilities to a number of open-source projects, including the Mozilla browser (http://www.mozilla.org/).


If you're using a stock PHP binary, it's quite likely that you'll need to recompile PHP to add support for this library to your PHP build. Detailed instructions for accomplishing this are available in Appendix A, "Recompiling PHP to Add XML Support."



A Simple Example


You can do a number of complex things with a SAX parser; however, I'll begin with something simple to illustrate just how it all fits together. Let's go back to the previous XML document (see Listing 2.1), and write some PHP code to process this document and do something with the data inside it (see Listing 2.2).



Listing 2.2 Generic PHP-Based XML Parser


<html>
<head>
<basefont face="Arial">
</head>
<body>

<?php
// XML data file
$xml_file = "fox.xml";

// initialize parser
$xml_parser = xml_parser_create();

// set callback functions
xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler");
xml_set_character_data_handler($xml_parser, "characterDataHandler");

// read XML file
if (!($fp = fopen($xml_file, "r")))
{
die("File I/O error: $xml_file");
}
// parse XML
while ($data = fread($fp, 4096))
{
// error handler
if (!xml_parse($xml_parser, $data, feof($fp)))
{
die("XML parser error: " .
xml_error_string(xml_get_error_code($xml_parser)));
}
}

// all done, clean up!
xml_parser_free($xml_parser);
?>
</body>
</html>

I'll explain Listing 2.2 in detail:



  1. The first order of business is to initialize the SAX parser. This is accomplished via PHP's aptly named xml_parser_create() function, which returns a handle for use in successive operations involving the parser.


    $xml_parser = xml_parser_create();

  2. With the parser created, it's time to let it know which events you would like it to monitor, and which user-defined functions (or callback functions) it should call when these events occur. For the moment, I'm going to restrict my activities to monitoring start tags, end tags, and the data embedded within them:


    xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler");
    xml_set_character_data_handler($xml_parser, "characterDataHandler");


    Speaking Different Tongues


    It's possible to initialize the parser with a specific encoding. For example:



    $xml_parser = xml_parser_create("UTF-8");

    PHP's SAX parser currently supports the following encodings:



    • ISO-8859-1


    • US-ASCII


    • UTF-8



    An attempt to use an unsupported encoding will result in a slew of ugly error messages. Try it yourself to see what I mean.



    What have I done here? Very simple. I've told the parser to call the function startElementHandler() when it finds an opening tag, the function endElementHandler() when it finds a closing tag, and the function characterDataHandler() whenever it encounters character data within the document.


    When the parser calls these functions, it will automatically pass them all relevant information as function arguments. Depending on the type of callback registered, this information could include the element name, element attributes, character data, processing instructions, or notation identifiers.


    From Listing 2.2, you can see that I haven't defined these functions yet; I'll do that a little later, and you'll see how this works in practice. Until these functions have been defined, any attempt to run the code from Listing 2.2 as it is right now will fail.


  3. Now that the callback functions have been registered, all that remains is to actually parse the XML document. This is a simple exercise. First, create a file handle for the document:


    if (!($fp = fopen($xml_file, "r")))
    {
    die("File I/O error: $xml_file");
    }

    Then, read in chunks of data with fread(), and parse each chunk using the xml_parse() function:



    while ($data = fread($fp, 4096))
    {
    // error handler
    if (!xml_parse($xml_parser, $data, feof($fp)))
    {
    die("XML parser error: " .
    xml_error_string(xml_get_error_code($xml_parser)));
    }
    }

    In the event that errors are encountered while parsing the document, the script will automatically terminate via PHP's die() function. Detailed error information can be obtained via the xml_error_string() and xml_get_error_code() functions (for more information on how these work, see the "Handling Errors" section).


  4. After the complete file has been processed, it's good programming practice to clean up after yourself by destroying the XML parser you created:


    xml_parser_free($xml_parser);

    That said, in the event that you forget, PHP will automatically destroy the parser for you when the script ends.




Endgame


You already know that SAX can process XML data in chunks, making it possible to parse XML documents larger than available memory. Ever wondered how it knows when to stop?


That's where the optional third parameter to xml_parse() comes in. As each chunk of data is read from the XML file, it is passed to the xml_parse() function for processing. When the end of the file is reached, the feof() function returns true, which tells the parser to stop and take a well-deserved break.



The preceding four steps make up a pretty standard process, and you'll find yourself using them over and over again when processing XML data with PHP's SAX parser. For this reason, you might find it more convenient to package them as a separate function, and call this function wherever required�a technique demonstrated in Listing 2.23.


With the generic XML processing code out of the way, let's move on to the callback functions defined near the top of the script.You'll remember that I registered the following three functions:




  • startElementHandler()
    Executed when an opening tag is encountered




  • endElementHandler()
    Executed when a closing tag is encountered




  • characterDataHandler()
    Executed when character data is encountered




Listing 2.3 is the revised script with these handlers included.



Listing 2.3 Defining SAX Callback Functions


<html>
<head>
<basefont face="Arial">
</head>
<body>
<?php

// run when start tag is found
function startElementHandler($parser, $name, $attributes)
{
echo "Found opening tag of element: <b>$name</b> <br>";

// process attributes
while (list ($key, $value) = each ($attributes))
{
echo "Found attribute: <b>$key = $value</b> <br>";
}
}

// run when end tag is found
function endElementHandler($parser, $name)
{
echo "Found closing tag of element: <b>$name</b> <br>";
}
// run when cdata is found
function characterDataHandler($parser, $cdata)
{
echo "Found CDATA: <i>$cdata</i> <br>";
}

// XML data file
$xml_file = "fox.xml";

// initialize parser
$xml_parser = xml_parser_create();

// set callback functions
xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler");
xml_set_character_data_handler($xml_parser, "characterDataHandler");

// read XML file
if (!($fp = fopen($xml_file, "r")))
{
die("File I/O error: $xml_file");
}

// parse XML
while ($data = fread($fp, 4096))
{
// error handler
if (!xml_parse($xml_parser, $data, feof($fp)))
{
die("XML parser error: " .
xml_error_string(xml_get_error_code($xml_parser)));
}
}

// all done, clean up!
xml_parser_free($xml_parser);

?>
</body>
</html>

Nothing too complex here. The tag handlers print the names of the tags they encounter, whereas the character data handler prints the data enclosed within the tags. Notice that the startElementHandler() function automatically receives the tag name and attributes as function arguments, whereas the characterDataHandler() gets the CDATA text.


And when you execute the script through a browser, here's what the end product looks like (and if you're wondering why all the element names are in uppercase, take a look at the "Controlling Parser Behavior" section):



Found opening tag of element: SENTENCE
Found CDATA: The
Found opening tag of element: ANIMAL
Found attribute: COLOR = blue
Found CDATA: fox
Found closing tag of element: ANIMAL
Found CDATA: leaped over the
Found opening tag of element: VEGETABLE
Found attribute: COLOR = green
Found CDATA: cabbage
Found closing tag of element: VEGETABLE
Found CDATA: patch and vanished into the darkness.
Found closing tag of element: SENTENCE

Not all that impressive, certainly�but then again, we're just getting started!







    I l@ve RuBoard



    No comments:

    Post a Comment