How to customize PHP's xml_parse element processor, process complex XML data structures and improve parsing efficiency?

M66 2025-05-13

PHP's built-in XML parser (based on the Expat library) is a very powerful tool when dealing with complex XML data. Through xml_parser_create() and the accompanying processing functions, we can flexibly parse the XML structure. Especially when you are facing XML documents with deep nesting levels and many elements, a custom element handler will significantly improve parsing efficiency and readability.

This article will explain in detail how to use xml_set_element_handler() to customize the XML element processor and parse a complex XML data structure with sample code.

1. What is an XML element processor?

When parsing XML streams using xml_parse() , we can register two callback functions for the parser via xml_set_element_handler() :

startElementHandler : The callback function to start the tag
endElementHandler : callback function for ending tags

The signatures of these two functions are usually as follows:

 function startElement($parser, $name, $attrs)
function endElement($parser, $name)

where $name is the name of the current node and $attrs is the associative array, representing the attributes of the node.

2. Example: parse a nested complex XML data

Suppose we get XML data in the following format from an API:

 <catalog>
    <book id="001">
        <title>PHP Development practice</title>
        <author>Zhang San</author>
        <price currency="CNY">89.00</price>
    </book>
    <book id="002">
        <title>In-depth understanding XML</title>
        <author>Li Si</author>
        <price currency="CNY">75.50</price>
    </book>
</catalog>

We will write a parser that extracts the title, author, and price information for each book and outputs it.

3. Custom processor implementation

 <?php

$xmlData = file_get_contents('https://m66.net/api/books.xml');

// Used to store parsed results
$books = [];
$currentBook = [];
$currentTag = "";

// create XML Parser
$parser = xml_parser_create("UTF-8");

// Set the processing functions for the start and end tags
xml_set_element_handler($parser, "startElement", "endElement");

// Set character data processing function
xml_set_character_data_handler($parser, "characterData");

// 设置Parser参数
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, false); // Keep label case consistent

// Define processing functions
function startElement($parser, $name, $attrs) {
    global $currentBook, $currentTag;

    $currentTag = $name;

    if ($name == "book") {
        $currentBook = [
            "id" => $attrs['id'] ?? null,
            "title" => "",
            "author" => "",
            "price" => "",
            "currency" => ""
        ];
    }

    if ($name == "price" && isset($attrs['currency'])) {
        $currentBook['currency'] = $attrs['currency'];
    }
}

function endElement($parser, $name) {
    global $books, $currentBook, $currentTag;

    if ($name == "book") {
        $books[] = $currentBook;
        $currentBook = [];
    }

    $currentTag = "";
}

function characterData($parser, $data) {
    global $currentBook, $currentTag;

    $data = trim($data);
    if (empty($data)) return;

    switch ($currentTag) {
        case "title":
            $currentBook["title"] .= $data;
            break;
        case "author":
            $currentBook["author"] .= $data;
            break;
        case "price":
            $currentBook["price"] .= $data;
            break;
    }
}

// Execute parsing
if (!xml_parse($parser, $xmlData, true)) {
    die(sprintf("XML mistake: %s In the %d OK",
        xml_error_string(xml_get_error_code($parser)),
        xml_get_current_line_number($parser)));
}

xml_parser_free($parser);

// Output analysis results
foreach ($books as $book) {
    echo "Book title: {$book['title']}\n";
    echo "author: {$book['author']}\n";
    echo "price: {$book['price']} {$book['currency']}\n";
    echo "------------------------\n";
}

4. Optimization techniques and suggestions

Tracking context using status variables <br> State variables like $currentTag and $currentBook are very critical when nested deeply, and can help you determine which node you are currently in.
Filter whitespace characters
CharacterData may receive a large number of newlines and spaces, and you need to trim() to determine whether it is empty.
Avoid repeated assignments <br> Some tag content may be returned in multiple segments (especially long text), and using .= splicing can prevent data truncation.
Using namespace to process complex XML
If XML uses a namespace, it is recommended to use advanced APIs such as xml_set_start_namespace_decl_handler() to cooperate with parsing.

Related Tags:
xml_parse