Current Location: Home> Latest Articles> How encoding problems affect the results of md5_file()

How encoding problems affect the results of md5_file()

M66 2025-06-02

In PHP development, md5_file() is a function commonly used to generate file content hash values. Developers usually use it to verify whether the file has changed or to verify file integrity. However, many developers encounter a seemingly "inexplicable" problem when using this function: the hash value obtained by using md5_file() in different encoding environments is different.

This seems counterintuitive, but there are clear technical reasons behind it. This article will explore why this situation occurs from the perspective of encoding.

The essence of md5_file()

First, we need to understand the essence of md5_file() :

 $hash = md5_file('/path/to/file.txt');

This function reads the original binary data of the entire file and then calculates its MD5 value. Therefore, it focuses on the file byte content itself , rather than human-readable text.

In other words, the MD5 value will be different even if the text visually displayed is exactly the same as long as any change in the bytes in the file.

Different encodings may be different

A common misunderstanding is that if the content is the same, the same MD5 value should be obtained. In fact:

  • The character "medium" is three bytes in UTF-8: 0xE4 0xB8 0xAD

  • In GBK, it is two bytes: 0xD6 0xD0

If you have two files, one is UTF-8 encoding and the other is GBK encoding, which visually says "Chinese test", but after reading md5_file(), you will find that their underlying byte streams are different, and the natural hash values ​​are also different.

The encoding when saving the file will also affect the results

Developers often write PHP or text files in the editor. If the editor saves it as UTF-8 by default (with or without BOM), or saves it as ANSI/GBK, it will cause the actual byte stream of the file to be inconsistent.

For example, saving a file in Windows Notepad is ANSI encoding by default; while saving in VS Code is BOM-less by default. The contents of the two files seem to be the same, but through the following code:

 echo md5_file('file-ansi.txt') . "\n";
echo md5_file('file-utf8.txt') . "\n";

You will see different hash outputs.

Example: Comparing two files with different encodings

Suppose we deploy the following PHP script on m66.net :

 $file1 = 'https://m66.net/files/utf8.txt'; // UTF-8 coding
$file2 = 'https://m66.net/files/gbk.txt';  // GBK coding

echo 'UTF-8: ' . md5_file($file1) . "\n";
echo 'GBK: ' . md5_file($file2) . "\n";

The run results will clearly show that the MD5 values ​​of the two are different.

How to avoid this problem?

  1. Unified encoding format : Forced use of UTF-8 (no BOM) as the only encoding format in the project is the easiest and most effective way.

  2. Convert encoding before saving the file : Use tools such as iconv or mb_convert_encoding() to convert the file contents into a unified format.

For example:

 $content = file_get_contents('file.txt');
$content = mb_convert_encoding($content, 'UTF-8', 'GBK');
file_put_contents('converted.txt', $content);
  1. Confirm editor settings : Make sure that the IDE or text editor you are using set a consistent default encoding format.

Summarize

md5_file() depends on the original byte stream of the file, and any encoding difference will affect its calculation results. Understanding this is crucial to dealing with multilingual and multi-platform file content. In actual projects, always keeping file encoding consistent is a key measure to ensure the effectiveness of hash verification.