When parsing XML data using PHP's xml_parse() function, if the XML is UTF-8 encoded and is handled improperly, it may cause garbled content to appear in parsed. This problem is common when XML file headers declare UTF-8 encoding, but the encoding is not correctly recognized or converted during actual reading or processing. This article will introduce the causes of the problem and provide specific solutions.
xml_parse() is the interface to the Expat parser for PHP. Expat itself is very strict with character encoding, and it requires that the input XML string encoding must be clear and consistent, especially UTF-8. If the provided XML data is declared to be UTF-8 but is not, or if PHP performs incorrect encoding conversion when processing these data, garbled code will appear.
Another common problem is that when reading XML files from outside (such as fetching via URL), the appropriate stream encoding is not set or converted to UTF-8, resulting in inconsistent encoding.
<?php
$xml = file_get_contents("https://m66.net/data/sample.xml");
$parser = xml_parser_create(); // Used by default ISO-8859-1
xml_parse($parser, $xml, true);
echo "Successful analysis";
xml_parser_free($parser);
?>
Although the above code can be run, if sample.xml is UTF-8 encoded, it may cause garbled code or fail directly during parsing.
You can use the parameter of xml_parser_create() to specify the encoding as UTF-8, telling the parser to use the correct encoding:
<?php
$xml = file_get_contents("https://m66.net/data/sample.xml");
$parser = xml_parser_create('UTF-8'); // Explicitly specify UTF-8
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, 'UTF-8');
xml_parse($parser, $xml, true);
echo "Successful analysis";
xml_parser_free($parser);
?>
This method ensures that the parser reads data according to UTF-8 and can also correctly process XML content containing Chinese or other multibyte characters.
If you are not sure if the data obtained from an external source (such as an interface or a remote XML file) is really UTF-8, you can use mb_detect_encoding() or iconv() to confirm or convert:
<?php
$xml = file_get_contents("https://m66.net/data/sample.xml");
// Detect and convert to UTF-8
if (mb_detect_encoding($xml, 'UTF-8', true) === false) {
$xml = iconv('GBK', 'UTF-8', $xml); // Modify the original encoding as appropriate
}
$parser = xml_parser_create('UTF-8');
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, 'UTF-8');
xml_parse($parser, $xml, true);
echo "Successful analysis";
xml_parser_free($parser);
?>
This can avoid the problem of inconsistent encoding, especially when dealing with data from third-party platforms or different systems.
Check whether the header declaration of the XML file contains the following content:
<?xml version="1.0" encoding="UTF-8"?>
If UTF-8 is declared, but the actual encoding is not UTF-8, then even if it is forced to parse in PHP, garbled code or parsing failure may occur. At this point, the source file encoding should be corrected first or convert it using PHP.