In PHP, file hash calculation is usually used in scenarios such as data integrity verification, file deduplication, digital signatures, etc. When processing large files, hash_file() and hash_update_stream() are common hash calculation methods. Although the two have similar functions, there are some differences in performance. This article will compare the performance differences between these two methods when processing large files, helping developers better choose the right functions to improve the efficiency of their applications.
The hash_file() function is used to directly calculate the hash value of the specified file. The syntax is as follows:
string hash_file ( string $algo , string $filename [, bool $binary = false ] )
$algo : Specifies the hashing algorithm used, such as md5 , sha256 , etc.
$filename : file path, indicating the file to calculate the hash value.
$binary : If true , return the binary hash value; if false , return the hexadecimal hash value.
For example, calculate the sha256 hash value of a file:
$fileHash = hash_file('sha256', 'largefile.txt');
hash_file() is a very easy way because it automatically reads the entire file and calculates the hash value. However, when processing large files, directly reading the entire file may result in high memory consumption, especially in environments with limited memory resources.
Unlike hash_file() , hash_update_stream() is a method of calculating hash values in segments, which allows the file content to be read step by step when calculating hash. This is especially useful for large files, as it does not require the entire file to be loaded into memory at once, but can be processed chunk by block.
bool hash_update_stream ( resource $context , string $data [, int $length = 0 ] )
$context : is a hash context created through hash_init() .
$data : Part of the data to be calculated.
$length : Optional parameter, specifying the length of data to be read.
The basic steps for calculating a large file hash are as follows:
$hashContext = hash_init('sha256'); // Initialize hash context
$handle = fopen('largefile.txt', 'rb'); // Open the file
while (!feof($handle)) {
$data = fread($handle, 8192); // Read files in chunks
hash_update_stream($hashContext, $data); // Update hash
}
$hash = hash_final($hashContext); // Get the final hash value
fclose($handle);
By reading files in chunks and updating hash values gradually, this method can manage memory more effectively and is suitable for processing large files.
hash_file() : This function will load the entire file into memory at one time for hash calculation, suitable for situations where the file size is moderate. If the file is very large (such as several GB), it will cause excessive memory usage and even cause memory overflow errors.
hash_update_stream() : This method reads the file step by step and updates the hash value, so the memory usage is lower. For large files, hash_update_stream() has more advantages because it stores data only in each read file block, thus avoiding the high memory requirement of loading the entire file.
In terms of speed, hash_file() usually has higher performance because it is an underlying function that directly calls the operating system's file reading function without requiring additional memory allocation or processing. However, as file size increases, its performance may be affected by memory limitations.
In contrast, hash_update_stream() requires more code execution logic (such as chunked reading and gradual update of hash), but when dealing with large files, its overall performance tends to be better due to more efficient memory management.
hash_update_stream() provides more flexibility. Developers can control the size of each read file (specified by fread() ) to find a balance between memory footprint and speed. For large files, the size of the read block can be adjusted according to the server's memory to optimize performance.
Use hash_file() : When the processed files are small and the system memory is sufficient, hash_file() is more concise and efficient.
Use hash_update_stream() : When the processed file is very large, or in a memory-limited environment, hash_update_stream() is more suitable because it can reduce memory usage through chunked reading.
To further verify the performance differences between the two, we can test them with a simple piece of code.
$start = microtime(true);
$hash = hash_file('sha256', 'largefile.txt');
$end = microtime(true);
echo "hash_file took: " . ($end - $start) . " seconds.\n";
$start = microtime(true);
$hashContext = hash_init('sha256');
$handle = fopen('largefile.txt', 'rb');
while (!feof($handle)) {
$data = fread($handle, 8192);
hash_update_stream($hashContext, $data);
}
$hash = hash_final($hashContext);
fclose($handle);
$end = microtime(true);
echo "hash_update_stream took: " . ($end - $start) . " seconds.\n";
By comparing the execution time of the two pieces of code, we can intuitively understand the performance differences between the two methods in actual scenarios.
For small and medium-sized files, hash_file() is simple and easy to use, faster, and is enough to meet most needs.
For large files, hash_update_stream() provides better memory control and is suitable for use in memory-constrained environments.
Choosing the right hashing method can not only improve performance, but also avoid waste of resources. For large file processing, it is recommended to use hash_update_stream() because it has lower memory footprint and more stable performance.