PHP and SAX
PHP 4.0 comes with a very capable SAX parser based on the expat library. Created by James Clark, the expat library is a fast, robust SAX implementation that provides XML parsing capabilities to a number of open-source projects, including the Mozilla browser (http://www.mozilla.org/).
If you're using a stock PHP binary, it's quite likely that you'll need to recompile PHP to add support for this library to your PHP build. Detailed instructions for accomplishing this are available in Appendix A, "Recompiling PHP to Add XML Support."
A Simple Example
You can do a number of complex things with a SAX parser; however, I'll begin with something simple to illustrate just how it all fits together. Let's go back to the previous XML document (see Listing 2.1), and write some PHP code to process this document and do something with the data inside it (see Listing 2.2).
Listing 2.2 Generic PHP-Based XML Parser
<html> <head> <basefont face="Arial"> </head> <body>
<?php // XML data file $xml_file = "fox.xml";
// initialize parser $xml_parser = xml_parser_create();
// set callback functions xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); xml_set_character_data_handler($xml_parser, "characterDataHandler");
// read XML file if (!($fp = fopen($xml_file, "r"))) { die("File I/O error: $xml_file"); } // parse XML while ($data = fread($fp, 4096)) { // error handler if (!xml_parse($xml_parser, $data, feof($fp))) { die("XML parser error: " . xml_error_string(xml_get_error_code($xml_parser))); } }
// all done, clean up! xml_parser_free($xml_parser); ?> </body> </html>
I'll explain Listing 2.2 in detail:
The first order of business is to initialize the SAX parser. This is accomplished via PHP's aptly named xml_parser_create() function, which returns a handle for use in successive operations involving the parser. $xml_parser = xml_parser_create();
With the parser created, it's time to let it know which events you would like it to monitor, and which user-defined functions (or callback functions) it should call when these events occur. For the moment, I'm going to restrict my activities to monitoring start tags, end tags, and the data embedded within them: xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); xml_set_character_data_handler($xml_parser, "characterDataHandler");
It's possible to initialize the parser with a specific encoding. For example:
$xml_parser = xml_parser_create("UTF-8");
PHP's SAX parser currently supports the following encodings:
An attempt to use an unsupported encoding will result in a slew of ugly error messages. Try it yourself to see what I mean.
|
What have I done here? Very simple. I've told the parser to call the function startElementHandler() when it finds an opening tag, the function endElementHandler() when it finds a closing tag, and the function characterDataHandler() whenever it encounters character data within the document.
When the parser calls these functions, it will automatically pass them all relevant information as function arguments. Depending on the type of callback registered, this information could include the element name, element attributes, character data, processing instructions, or notation identifiers.
From Listing 2.2, you can see that I haven't defined these functions yet; I'll do that a little later, and you'll see how this works in practice. Until these functions have been defined, any attempt to run the code from Listing 2.2 as it is right now will fail.
Now that the callback functions have been registered, all that remains is to actually parse the XML document. This is a simple exercise. First, create a file handle for the document: if (!($fp = fopen($xml_file, "r"))) { die("File I/O error: $xml_file"); }
Then, read in chunks of data with fread(), and parse each chunk using the xml_parse() function:
while ($data = fread($fp, 4096)) { // error handler if (!xml_parse($xml_parser, $data, feof($fp))) { die("XML parser error: " . xml_error_string(xml_get_error_code($xml_parser))); } }
In the event that errors are encountered while parsing the document, the script will automatically terminate via PHP's die() function. Detailed error information can be obtained via the xml_error_string() and xml_get_error_code() functions (for more information on how these work, see the "Handling Errors" section).
After the complete file has been processed, it's good programming practice to clean up after yourself by destroying the XML parser you created: xml_parser_free($xml_parser);
That said, in the event that you forget, PHP will automatically destroy the parser for you when the script ends.
You already know that SAX can process XML data in chunks, making it possible to parse XML documents larger than available memory. Ever wondered how it knows when to stop?
That's where the optional third parameter to xml_parse() comes in. As each chunk of data is read from the XML file, it is passed to the xml_parse() function for processing. When the end of the file is reached, the feof() function returns true, which tells the parser to stop and take a well-deserved break.
|
The preceding four steps make up a pretty standard process, and you'll find yourself using them over and over again when processing XML data with PHP's SAX parser. For this reason, you might find it more convenient to package them as a separate function, and call this function wherever required�a technique demonstrated in Listing 2.23.
With the generic XML processing code out of the way, let's move on to the callback functions defined near the top of the script.You'll remember that I registered the following three functions:
startElementHandler()� Executed when an opening tag is encountered
endElementHandler()� Executed when a closing tag is encountered
characterDataHandler()� Executed when character data is encountered
Listing 2.3 is the revised script with these handlers included.
Listing 2.3 Defining SAX Callback Functions
<html> <head> <basefont face="Arial"> </head> <body> <?php
// run when start tag is found function startElementHandler($parser, $name, $attributes) { echo "Found opening tag of element: <b>$name</b> <br>";
// process attributes while (list ($key, $value) = each ($attributes)) { echo "Found attribute: <b>$key = $value</b> <br>"; } }
// run when end tag is found function endElementHandler($parser, $name) { echo "Found closing tag of element: <b>$name</b> <br>"; } // run when cdata is found function characterDataHandler($parser, $cdata) { echo "Found CDATA: <i>$cdata</i> <br>"; }
// XML data file $xml_file = "fox.xml";
// initialize parser $xml_parser = xml_parser_create();
// set callback functions xml_set_element_handler($xml_parser, "startElementHandler", "endElementHandler"); xml_set_character_data_handler($xml_parser, "characterDataHandler");
// read XML file if (!($fp = fopen($xml_file, "r"))) { die("File I/O error: $xml_file"); }
// parse XML while ($data = fread($fp, 4096)) { // error handler if (!xml_parse($xml_parser, $data, feof($fp))) { die("XML parser error: " . xml_error_string(xml_get_error_code($xml_parser))); } }
// all done, clean up! xml_parser_free($xml_parser);
?> </body> </html>
Nothing too complex here. The tag handlers print the names of the tags they encounter, whereas the character data handler prints the data enclosed within the tags. Notice that the startElementHandler() function automatically receives the tag name and attributes as function arguments, whereas the characterDataHandler() gets the CDATA text.
And when you execute the script through a browser, here's what the end product looks like (and if you're wondering why all the element names are in uppercase, take a look at the "Controlling Parser Behavior" section):
Found opening tag of element: SENTENCE Found CDATA: The Found opening tag of element: ANIMAL Found attribute: COLOR = blue Found CDATA: fox Found closing tag of element: ANIMAL Found CDATA: leaped over the Found opening tag of element: VEGETABLE Found attribute: COLOR = green Found CDATA: cabbage Found closing tag of element: VEGETABLE Found CDATA: patch and vanished into the darkness. Found closing tag of element: SENTENCE
Not all that impressive, certainly�but then again, we're just getting started!
|
No comments:
Post a Comment