It is a common way to parse XML using the xml_parse function in PHP, but when an XML file contains special characters (such as & , < , > or illegal UTF-8 characters), it is easy to cause parsing failure or even script errors. This article will explore several common problems encountered in actual development and provide corresponding solutions to avoid falling into common traps when parsing XML with special characters.
XML requires that the characters must be legal UTF-8 characters. If the input contains illegal characters (such as control characters or illegal encoding), xml_parse will directly return an error.
Use iconv or mb_convert_encoding to preprocess the content:
$rawXml = file_get_contents('https://m66.net/data.xml');
$cleanXml = mb_convert_encoding($rawXml, 'UTF-8', 'UTF-8');
Or use regular to clear illegal control characters:
$cleanXml = preg_replace('/[^\x09\x0A\x0D\x20-\x7E\xA0-\xFF]/', '', $rawXml);
In XML, & must be written as & . If the original XML document contains an unescaped & , xml_parse will report an error.
You can use htmlspecialchars or str_replace for preprocessing, but be careful to prevent excessive escaping:
$cleanXml = str_replace('&', '&', $rawXml);
// Notice:This is just an example,When using it, you must determine whether it has been escaped.,Avoid repeated escapes
A safer way is to verify that XML is legal:
libxml_use_internal_errors(true);
$xml = simplexml_load_string($rawXml);
if (!$xml) {
foreach (libxml_get_errors() as $error) {
echo "XML Error: " . $error->message;
}
}
If the XML file header does not specify the encoding, or does not match the actual encoding, an xml_parse error may occur.
Force the standard header to XML to ensure consistent encoding:
if (strpos($rawXml, '<?xml') === false) {
$rawXml = '<?xml version="1.0" encoding="UTF-8"?>' . $rawXml;
}
Developers sometimes forget to call xml_parser_free , resulting in resource leakage or abnormal behavior.
$parser = xml_parser_create('UTF-8');
xml_set_element_handler($parser, 'startElement', 'endElement');
xml_set_character_data_handler($parser, 'characterData');
if (!xml_parse($parser, $cleanXml, true)) {
die(sprintf("XML Error: %s at line %d",
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser)));
}
xml_parser_free($parser);
By default, no exception will be thrown after an error occurs in xml_parse , and error information needs to be checked manually.
Use error detection functions such as xml_get_error_code() and xml_error_string() to improve debugging efficiency.
Although xml_parse is an underlying and well-performance parsing method, you must be very careful about the legitimacy and character encoding issues of XML when using it. We recommend that pre-cleaning, encoding checksum error detection be performed first when dealing with untrusted or third-party-provided XML files to minimize the risk of parsing failure.
If there are more complex XML structures and requirements, you can also consider using more modern parsing tools such as DOMDocument or SimpleXML , which are more fault-tolerant to special characters and are more concise to use.