When working with large XML files, single-threaded parsing can cause excessive memory usage or excessive execution time. PHP itself does not natively support "real" multithreading (unless using extensions such as pthreads or Swoole), but we can process large XML files in parallel by simulating multithreading methods (such as creating multiple child processes with proc_open ) to improve parsing efficiency.
This article will demonstrate how to combine the xml_parse function and proc_open to implement pseudo-multithreaded parsing of large XML files.
xml_parse is one of the underlying parsing functions of PHP and is part of the Expat parser. It supports event-based parsing and is ideal for streaming large XML data streams. xml_parse saves more resources than DOM loading the entire document into memory.
We can't directly let multiple threads share the xml_parser object, but we can:
Block large XML files (divided by node) ;
Use proc_open() or shell_exec() to start multiple PHP child processes;
Each child process parses its own XML block;
The main process collects the results and merges them.
Suppose we have a large XML file /data/huge.xml with the following structure:
<items>
<item><id>1</id><name>Item 1</name></item>
<item><id>2</id><name>Item 2</name></item>
...
</items>
<?php
$sourceFile = '/data/huge.xml';
$tempDir = '/tmp/xml_chunks/';
$chunkSize = 1000; // Each child process parsing 1000 indivual <item>
$urls = [];
// Make sure the temporary directory exists
if (!is_dir($tempDir)) {
mkdir($tempDir, 0777, true);
}
// Split XML document
$handle = fopen($sourceFile, 'r');
$chunkIndex = 0;
$buffer = '';
$itemCount = 0;
while (($line = fgets($handle)) !== false) {
if (strpos($line, '<item>') !== false) {
$itemCount++;
}
$buffer .= $line;
if ($itemCount >= $chunkSize || feof($handle)) {
$chunkFile = $tempDir . "chunk_{$chunkIndex}.xml";
file_put_contents($chunkFile, "<items>\n" . $buffer . "\n</items>");
$urls[] = "http://m66.net/worker.php?file=" . urlencode($chunkFile);
$chunkIndex++;
$buffer = '';
$itemCount = 0;
}
}
fclose($handle);
// Parallel call worker Parser(Can be changed to curl_multi_exec Improve efficiency)
foreach ($urls as $url) {
shell_exec("php worker.php '{$url}' > /dev/null &");
}
echo "Started " . count($urls) . " indivual解析任务。\n";
<?php
if ($argc < 2) {
exit("Please pass in XML document路径参数\n");
}
$xmlFile = urldecode($argv[1]);
if (!file_exists($xmlFile)) {
exit("document不存在: $xmlFile\n");
}
$parser = xml_parser_create();
xml_set_element_handler($parser, "startElement", "endElement");
xml_set_character_data_handler($parser, "characterData");
$currentTag = '';
$currentItem = [];
function startElement($parser, $name, $attrs) {
global $currentTag;
$currentTag = strtolower($name);
}
function endElement($parser, $name) {
global $currentTag, $currentItem;
if (strtolower($name) == 'item') {
// Example:将解析结果保存到document或数据库
file_put_contents('/tmp/parsed_result.txt', json_encode($currentItem) . PHP_EOL, FILE_APPEND);
$currentItem = [];
}
$currentTag = '';
}
function characterData($parser, $data) {
global $currentTag, $currentItem;
if (trim($data)) {
$currentItem[$currentTag] = trim($data);
}
}
$fp = fopen($xmlFile, 'r');
while ($data = fread($fp, 4096)) {
xml_parse($parser, $data, feof($fp)) or
die(sprintf("XML mistake: %s", xml_error_string(xml_get_error_code($parser))));
}
fclose($fp);
xml_parser_free($parser);
echo "Analysis is completed: $xmlFile\n";
Performance improvement : On multi-core CPUs, each child process runs independently, which can speed up the overall parsing speed in parallel.
Memory control : The amount of data processed by each child process is controllable to avoid bursting of memory.
Security : Ensure that the file path is not directly passed through URL parameters in the production environment, and whitelist verification should be added.
Process management : You can use pcntl_fork or Swoole to replace shell_exec to achieve more stable child process management.
Although PHP itself is not an ideal language for concurrent processing, through xml_parse and process control techniques, we can still efficiently parse large XML files. This method is particularly suitable for task scenarios that require efficiency such as log processing and data import.
If further improvements are needed, it is recommended to rewrite the parsing module in concurrency-friendly languages such as Go/Python, and then schedule it through PHP.