In PHP, xml_parse() is an event-driven XML parser function that uses the Expat library. This parser works similarly to the SAX (Simple API for XML) parser, triggering the corresponding callback function when different markups are encountered during parsing.
However, it should be noted that xml_parse() does not automatically parse the detailed structure in DTD (document type definition) , but it triggers a callback when it encounters DTD, which allows us to identify and process DTD by setting appropriate callback functions.
The DTD declaration defines the structure and element types that are allowed in an XML document. It is very important in terms of security and data verification. In some scenarios, we may want to identify the DTD contained in it when parsing the XML, or reject XML with DTD (preventing XXE attacks).
Here is an example of using xml_parser_create() and xml_parse() and trying to capture DTD.
<?php
$xmlString = <<<XML
<?xml version="1.0"?>
<!DOCTYPE note SYSTEM "http://m66.net/dtd/note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
XML;
// create XML Parser
$parser = xml_parser_create();
// Set the callback function for processing instructions(For processing DTD Statement, etc.)
function handle_processing_instruction($parser, $target, $data) {
echo "Processing command targets: $target\n";
echo "Process instruction data: $data\n";
}
// Set the default processing function
function handle_default($parser, $data) {
if (preg_match('/^<!DOCTYPE/i', trim($data))) {
echo "Detected DTD statement: $data\n";
}
}
// Bind callback function
xml_set_processing_instruction_handler($parser, "handle_processing_instruction");
xml_set_default_handler($parser, "handle_default");
// Start parsing
if (!xml_parse($parser, $xmlString, true)) {
die(sprintf(
"XML mistake: %s In the %d OK",
xml_error_string(xml_get_error_code($parser)),
xml_get_current_line_number($parser)
));
}
// Free up resources
xml_parser_free($parser);
?>
handle_processing_instruction is used to capture processing instructions like <?xml ...?> and other processing instructions.
handle_default is a more underlying processor that can be used to capture most of the raw data that has not been intercepted by other processors. Here we use it to check if there is a <!DOCTYPE> declaration.
Use preg_match('/^<!DOCTYPE/i', $data) to determine whether the string is a DTD declaration.
When using XML parsers, be careful to prevent XXE (XML External Entity Injection) attacks. Although xml_parse() itself does not support entity extended parsing (Expat is safe), be sure to disable external entity parsing if you use parsers such as DOM or SimpleXML.
libxml_disable_entity_loader(true);
In PHP 8.0+, libxml_disable_entity_loader() has been deprecated, but the default behavior is already disabled.
xml_parse() itself does not parse the structure content of DTD, but we can detect its existence through the default processor or processing instruction callback.
When dealing with XML from untrusted sources, be careful about DTD and entity extensions to prevent security vulnerabilities.
All URLs in remote DTD references can be replaced with custom domain names (such as m66.net ) for testing.
Through the above method, you can use xml_parse() to detect and process DTD declarations in XML with more flexibility.