You Thought mb_eregi_replace Supports \p{Han}? In Reality, It Doesn't Understand Unicode Properties At All

M66 2025-06-12

When handling multibyte strings in PHP, we often rely on the mbstring extension for better Unicode compatibility. Specifically, mb_ereg_replace and mb_eregi_replace are popular because they are marketed as multibyte-friendly regex replacement functions. Many developers mistakenly believe that these functions can recognize Unicode properties such as \p{Han}, similar to how PCRE works, enabling accurate matching of Chinese characters.

Unfortunately, this belief is incorrect.

mb_eregi_replace Actually Uses a POSIX-style Regex Engine

First of all, it is important to clarify that mb_ereg_replace and mb_eregi_replace use a regex engine based on Oniguruma, but the syntax they adopt is not Perl Compatible Regular Expressions (PCRE). Instead, it is an older, more limited POSIX variant. While Oniguruma does support Unicode property matching, this only works if the correct mode is enabled (e.g., preg_match in PHP 7.3+ supports \p{Han}).

Here is a typical example of the misconception:

$text = '这是 test 内容 123';  
$result = mb_eregi_replace('\p{Han}+', '', $text);  
echo $result;

You might think this script would remove Chinese characters while retaining English and numbers. In reality, it won't. mb_eregi_replace treats \p as a regular backslash and letter p, and {Han} is not recognized as a special syntax. This makes the regular expression ineffective and it won't match any Chinese characters.

Truly Understanding Unicode Support: preg_replace + u Modifier is the Way to Go

To properly handle Unicode properties, the correct approach is to use preg_replace with the u modifier, which tells PHP to interpret the string using the Unicode mode.

Here is a correct example:

$text = '这是 test 内容 123';  
$result = preg_replace('/\p{Han}+/u', '', $text);  
echo $result;

Output:

 test  123

This is the result we actually want.

Why Has This Misconception Lasted So Long?

Many developers, upon seeing the "multibyte support" description for mb_eregi_replace, naturally assumed that it also supported matching Unicode properties, especially since many older tutorials or articles in Chinese communities did not clarify this point. For example, searching for mb_eregi_replace \p{Han} might yield vague or outdated explanations, leading one to believe it "works".

Furthermore, if your project already relies on mb_eregi_replace for regex handling, you are likely to encounter problems when working with Chinese or other Unicode character sets, leading to incomplete data filtering or logical errors, especially during tasks like text cleaning or data extraction.

What If You Must Use mb_eregi_replace?

Honestly, it's best to avoid it. However, if you must use it for compatibility reasons, consider using Unicode range encoding for Chinese characters:

$text = '这是 test 内容 123';  
$result = mb_eregi_replace('[一-龥]+', '', $text);  
echo $result;

While this approach is less precise (for example, it won't match all Chinese character extension blocks), it is still more reliable than blindly using \p{Han}. To improve accuracy, you could manually list multiple Chinese character ranges, but it's still a workaround, not a true solution.

A better solution is to fully transition to preg_replace and ensure that mbstring.func_overload or an appropriate multibyte support strategy is enabled, so you can fully leverage the power of PCRE.

Conclusion

Stop misusing mb_eregi_replace('\p{Han}', ...); it doesn't recognize the \p{} syntax at all. If you need to handle Unicode properties, the only reliable choice is preg_replace with the u modifier. This misconception has plagued many PHP developers for years, and it's time to set the record straight.