How to use hash_update_stream() function to achieve efficient file deduplication function?

M66 2025-05-18

File deduplication is a very important part of data processing, especially when a large number of files need to be processed, removing duplicate files can save storage space and improve system efficiency. PHP provides a very powerful function hash_update_stream() , which can help us perform hash calculations more efficiently during file deduplication. This article will introduce in detail how to use the hash_update_stream() function to achieve efficient file deduplication function.

1. What is the hash_update_stream() function?

hash_update_stream() is one of the built-in hash functions in PHP. It can process large files without loading files into memory at once by gradually reading the contents of files and calculating the hash value. It is suitable for hash calculations of large data files and can update hash values in real time.

Its function signature is as follows:

 bool hash_update_stream ( resource $context , resource $file , int $length = 8192 )

$context : The hash context created by hash_init() .
$file : The file resource to calculate the hash value.
$length : The number of bytes per file read, default is 8192 bytes.

2. Use hash_update_stream() to achieve file deduplication

In the case of file deduplication, we usually need to calculate a hash value for each file and then determine whether the hash value already exists. If the hash value already exists, it means the file is duplicated and we can delete it.

Through hash_update_stream() , we can effectively perform step-by-step hash calculations on large files without consuming too much memory, thereby improving the efficiency of deduplication.

3. Implementation steps

Here is a simple PHP example of file deduplication, using hash_update_stream() to process hash calculations of files:

 <?php

// Setting up hashing algorithm
$hash_algorithm = 'sha256';

// Get the path to deduplicate folder
$directory = '/path/to/your/files';

// Create an array that stores hashed values
$hashes = [];

// Get all files in the directory
$files = scandir($directory);

foreach ($files as $file) {
    $file_path = $directory . DIRECTORY_SEPARATOR . $file;

    // jump over '.' and '..'
    if ($file === '.' || $file === '..') {
        continue;
    }

    // Initialize hash context
    $context = hash_init($hash_algorithm);

    // Open the file
    $file_resource = fopen($file_path, 'rb');
    if ($file_resource) {
        // Gradually update the hash value
        while (!feof($file_resource)) {
            hash_update_stream($context, $file_resource, 8192);
        }

        // Close file resources
        fclose($file_resource);

        // Get the final hash value of the file
        $hash = hash_final($context);

        // Check if the hash value already exists
        if (in_array($hash, $hashes)) {
            // If the file is repeated，Delete it
            unlink($file_path);
            echo "Delete duplicate files: $file\n";
        } else {
            // otherwise，Add hash value to an existing hash array
            $hashes[] = $hash;
        }
    } else {
        echo "无法Open the file: $file\n";
    }
}

echo "File deduplication is completed！\n";

?>

4. Code parsing

Get all files in the folder : First, we use the scandir() function to get all files in the directory. Note that we want to skip the . and .. folders.
Compute the hash value step by step : For each file, the hash context is initialized first, and then the file is read step by step and update the hash value using hash_update_stream() .
Deduplication judgment : By storing the calculated hash value array $hashes , we determine whether the hash value of the current file already exists. If it exists, it means that the file is repeated and the file is deleted directly; otherwise, add the hash value to the array and continue to process the next file.

5. Optimization and precautions

Memory optimization : With the hash_update_stream() function, we avoid loading the entire file into memory, so we can handle large files.
Concurrent deduplication : For large numbers of files, multi-threading or batch processing can be used to further increase deduplication speed. PHP natively does not support multithreading, but can be achieved using extensions such as pthreads or by distributing tasks to multiple processes.
Hash Collision : Although the possibility of hash collision is extremely low, in extreme cases, if the hash values of the two files are the same but the contents are different, it will still be misjudged as duplicate files. Therefore, choosing a hashing algorithm that is strong enough (such as sha256 ) can greatly reduce this risk.

6. Conclusion

Using the hash_update_stream() function, we can implement file deduplication operations very efficiently, especially when processing large files, which can significantly reduce memory consumption and improve efficiency. Through simple hash value judgment, we can easily delete duplicate files, save storage space, and improve system performance.

I hope this article can help you better understand and use hash_update_stream() to implement file deduplication function! If you have any questions, please leave a message in the comment area.