Implement file deduplication algorithm based on md5_file()

M66 2025-05-31

In actual development, file deduplication is a common requirement, especially when storing a large number of files. Avoiding duplicate files not only saves space, but also improves system efficiency. PHP provides a very convenient function md5_file() , which can help us quickly implement hash calculation of the file, so as to easily determine whether the file is duplicated.

What is md5_file()?

md5_file() is a built-in function in PHP to calculate the MD5 hash value of the specified file content. The basic syntax is as follows:

 string md5_file(string $filename, bool $raw_output = false)

$filename : The file path to be calculated.
$raw_output : Whether to output in raw binary format, default is false , returns a 32-bit hexadecimal string.

This function returns a unique summary of the file contents, which is very appropriate to determine whether the contents of the two files are the same.

Implement simple and efficient file deduplication algorithm

The idea is very simple:

Iterates through all files in the target folder.
Use md5_file() to calculate the hash value for each file.
Use an array to record the hash value that has appeared.
If the hash value of a file already exists, it is determined to be a duplicate file and you can choose to delete or skip.

Here is the sample code:

 <?php
$directory = '/path/to/your/files'; // File Directory
$hashes = []; // Used to store the hash value of the file

// Traverse all files in the directory
$files = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($directory));

foreach ($files as $file) {
    if ($file->isFile()) {
        $filePath = $file->getRealPath();
        $fileHash = md5_file($filePath); // Calculate filesMD5

        if (isset($hashes[$fileHash])) {
            // Find duplicate files，Perform processing，For example, delete
            echo "Duplicate files: {$filePath} 已存在Duplicate files {$hashes[$fileHash]}\n";
            // unlink($filePath); // If needed to delete，Uncomment this line
        } else {
            // Record new hash
            $hashes[$fileHash] = $filePath;
        }
    }
}
?>

Optimization suggestions

Batch processing : When there are many large directory files, you can scan in batches to avoid consuming a large amount of memory at one time.
Cache hash : For commonly used directories, the hash results can be cached to the database or file and read them directly next time to improve efficiency.
Replacement algorithm : MD5 is highly efficient, but has weak security; if the security requirements are high, you can consider using sha1_file() or hash_file() .

References

PHP official document: md5_file()