In actual development, file deduplication is a common requirement, especially when storing a large number of files. Avoiding duplicate files not only saves space, but also improves system efficiency. PHP provides a very convenient function md5_file() , which can help us quickly implement hash calculation of the file, so as to easily determine whether the file is duplicated.
md5_file() is a built-in function in PHP to calculate the MD5 hash value of the specified file content. The basic syntax is as follows:
string md5_file(string $filename, bool $raw_output = false)
$filename : The file path to be calculated.
$raw_output : Whether to output in raw binary format, default is false , returns a 32-bit hexadecimal string.
This function returns a unique summary of the file contents, which is very appropriate to determine whether the contents of the two files are the same.
The idea is very simple:
Iterates through all files in the target folder.
Use md5_file() to calculate the hash value for each file.
Use an array to record the hash value that has appeared.
If the hash value of a file already exists, it is determined to be a duplicate file and you can choose to delete or skip.
Here is the sample code:
<?php
$directory = '/path/to/your/files'; // File Directory
$hashes = []; // Used to store the hash value of the file
// Traverse all files in the directory
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($directory));
foreach ($files as $file) {
if ($file->isFile()) {
$filePath = $file->getRealPath();
$fileHash = md5_file($filePath); // Calculate filesMD5
if (isset($hashes[$fileHash])) {
// Find duplicate files,Perform processing,For example, delete
echo "Duplicate files: {$filePath} 已存在Duplicate files {$hashes[$fileHash]}\n";
// unlink($filePath); // If needed to delete,Uncomment this line
} else {
// Record new hash
$hashes[$fileHash] = $filePath;
}
}
}
?>
Batch processing : When there are many large directory files, you can scan in batches to avoid consuming a large amount of memory at one time.
Cache hash : For commonly used directories, the hash results can be cached to the database or file and read them directly next time to improve efficiency.
Replacement algorithm : MD5 is highly efficient, but has weak security; if the security requirements are high, you can consider using sha1_file() or hash_file() .
PHP official document: md5_file()