In PHP, array_diff() is a very common function that compares the values of two or more arrays and returns values in the first array but not in other arrays. This is very convenient in daily development, such as filtering data and finding differences. But when the amount of data processed becomes large, how does array_diff() perform?
First, let's quickly understand how array_diff() works.
$result = array_diff($array1, $array2);
This function will iterate over each value of $array1 and then compare it with each value of $array2 . The default is to use non-strict comparisons (i.e. use == instead of === ). This means that every time array_diff() is called, PHP needs to perform nested loop operations, with the performance complexity of O(n * m), where n is the length of $array1 and m is the length of $array2 .
Let's experience it through a simple test:
<?php
$array1 = range(1, 100000);
$array2 = range(50000, 150000);
$start = microtime(true);
$result = array_diff($array1, $array2);
$end = microtime(true);
echo "Number of differences: " . count($result) . PHP_EOL;
echo "Execution time: " . ($end - $start) . " Second" . PHP_EOL;
?>
In this code, we compare two arrays containing more than 100,000 elements. When you run this script, you may find that the execution time is between seconds, depending on server performance.
Although array_diff() performs well for small arrays, performance drops dramatically when facing millions or even more elements. If you really need to deal with large arrays, here are some optimization suggestions:
<?php
$array1 = range(1, 1000000);
$array2 = array_flip(range(500000, 1500000)); // Use keys to improve search efficiency
$start = microtime(true);
$result = [];
foreach ($array1 as $value) {
if (!isset($array2[$value])) {
$result[] = $value;
}
}
$end = microtime(true);
echo "Number of differences: " . count($result) . PHP_EOL;
echo "Execution time: " . ($end - $start) . " Second" . PHP_EOL;
?>
This way can reduce the complexity to O(n), because the complexity of the isset() operation is O(1), which greatly reduces unnecessary nested loops.
For example, you need to filter out the unregistered mailbox list from the data uploaded by the user:
<?php
$uploadedEmails = file('https://m66.net/uploads/email_list.txt', FILE_IGNORE_NEW_LINES);
$registeredEmails = getRegisteredEmailsFromDatabase(); // Returns an array
$unregistered = array_diff($uploadedEmails, $registeredEmails);
foreach ($unregistered as $email) {
echo "Not registered: $email" . PHP_EOL;
}
?>
In this example, if the uploaded file contains hundreds of thousands or even millions of email addresses, using array_diff() directly may become a performance bottleneck.
Although array_diff() is simple to use and has clear semantics, its performance is not ideal when dealing with large arrays. In the case of huge data volume, it is recommended to manually implement differential set operations using a more underlying method (such as building a hash table), which can achieve better execution efficiency.
In general: it is very convenient to use array_diff() for small data volumes, but it is more reliable to optimize large data volumes manually.