Current Location: Home> Latest Articles> Combining xml_parse and regular expressions to clean up noise information in XML data

Combining xml_parse and regular expressions to clean up noise information in XML data

M66 2025-04-26

When processing XML data, we often encounter some "noise information" - these may be illegal characters, useless tags, comments, or dirty data nested in CDATA. To ensure the accuracy of data parsing, we can preprocess the XML content using PHP's xml_parse function and regular expressions to clear these interference items and improve parsing efficiency and data reliability.

1. Understand the xml_parse function

xml_parse is an underlying XML parser provided by PHP, which is based on the Expat XML parser. It can read XML strings segment by segment and process nodes through callback functions. However, xml_parse has extremely high requirements for XML format. If there are illegal characters or format errors in XML, it will directly return to failure.

The example usage is as follows:

 $xml_parser = xml_parser_create();

xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");

$xml_data = file_get_contents("https://m66.net/sample.xml");

if (!xml_parse($xml_parser, $xml_data, true)) {
    die("XML Analysis failed: " . xml_error_string(xml_get_error_code($xml_parser)));
}

xml_parser_free($xml_parser);

function startElement($parser, $name, $attrs) {
    echo "Start Element: $name\n";
}

function endElement($parser, $name) {
    echo "Ending Element: $name\n";
}

function characterData($parser, $data) {
    echo "Data content: $data\n";
}

This code reads remote XML and uses a callback function to process each tag and data node step by step. However, if the XML contains illegal characters, such as control characters or incomplete CDATA nodes, parsing failure will occur.

2. Use regular expressions to clean up noise information

In order for xml_parse to work properly, the noise information in the XML must be cleaned before parsing. This can be done efficiently with regular expressions. Common "noise" include:

  • Control characters (such as ASCII 0-31)

  • Illegal HTML comments (such as <!----> or containing scripts)

  • Nested wrong tags

  • Extra whitespace or line break

Here are some processing examples:

 function cleanXmlData($xml) {
    // Remove illegal control characters
    $xml = preg_replace('/[^\x09\x0A\x0D\x20-\x7E\xA0-\xFF]/u', '', $xml);
    
    // Remove comment content
    $xml = preg_replace('/<!--.*?-->/s', '', $xml);

    // Replace invalid empty tag format
    $xml = preg_replace('/<(\w+)[^>]*>\s*<\/\1>/', '', $xml);

    // Clean up CDATA Hide script or inject content in
    $xml = preg_replace('/<!\[CDATA\[(.*?)\]\]>/s', function($matches) {
        $content = $matches[1];
        // Content can be filtered as needed,For example, remove <script>
        $content = preg_replace('/<script.*?>.*?<\/script>/is', '', $content);
        return "<![CDATA[$content]]>";
    }, $xml);

    return $xml;
}

3. Combined use cleaning and analysis

Integrate the cleanup steps and the XML parser:

 $raw_xml = file_get_contents("https://m66.net/raw-feed.xml");

$clean_xml = cleanXmlData($raw_xml);

$parser = xml_parser_create();
xml_set_element_handler($parser, "startElement", "endElement");
xml_set_character_data_handler($parser, "characterData");

if (!xml_parse($parser, $clean_xml, true)) {
    die("Clean up后 XML Analysis failed: " . xml_error_string(xml_get_error_code($parser)));
}

xml_parser_free($parser);

In this way, even if there is noisy information in the original XML file, it can be successfully parsed after cleaning, improving system stability.

4. Summary

Combining the cleaning methods of xml_parse and regular expressions can greatly improve our fault tolerance in processing XML data. Regularity can handle weakly structured "dirty" data, while xml_parse can efficiently process well-structured XML documents. The combination of the two is suitable for system scenarios such as log analysis, data collection, API gateways, etc. that rely heavily on XML.

Always remember: data preprocessing is the first step to successful parsing.