問題描述
我想在 UTF-8 字符串上使用 str_word_count()
.
I want to use str_word_count()
on a UTF-8 string.
這在 PHP 中安全嗎?在我看來應該是(特別是考慮到沒有 mb_str_word_count()
).
Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()
).
但是在 php.net 上有很多人通過展示他們自己的多字節兼容"版本函數.
But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.
所以我想我想知道...
So I guess I want to know...
鑒于
str_word_count
只是計算由" "
(空格)分隔的所有字符序列,它在多字節字符串上應該是安全的,即使它不一定知道字符序列,對嗎?
Given that
str_word_count
simply counts all character sequences in delimited by" "
(space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?
UTF-8 中是否有任何等效的空格"字符,它們不是 ASCII " "
(space)?#
Are there any equivalent 'space' characters in UTF-8, which are not ASCII " "
(space)?#
我猜這就是問題所在.
推薦答案
我覺得你猜對了.事實上,UTF-8 中有一些不屬于 US-ASCII 的空格字符.給你一個這樣的空間的例子:
I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:
- Unicode 字符 'NO-BREAK SPACE' (U+00A0):UTF-8 中的 2 個字節:0xC2 0xA0 (c2a0)
- Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)
也許還有:
- Unicode 字符 'NEXT LINE (NEL)' (U+0085)):UTF-8 中的 2 個字節:0xC2 0x85 (c285)
- Unicode 字符 'LINE SEPARATOR' (U+2028):UTF-8 中的 3 個字節:0xE2 0x80 0xA8 (e280a8)
- Unicode 字符PARAGRAPH SEPARATOR"(U+2029):UTF-8 中的 3 個字節:0xE2 0x80 0xA8 (e280a8)
- Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
- Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
- Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
無論如何,第一個 - 'NO-BREAK SPACE' (U+00A0) - 是一個很好的例子,因為它也是拉丁 X 字符集的一部分.并且 PHP 手冊已經提供了一個提示,即 str_word_count
將取決于語言環境.
Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count
would be locale dependent.
如果我們想對此進行測試,我們可以將語言環境設置為 UTF-8,傳入一個包含 xA0
序列的無效字符串,如果這仍然算作斷字字符,該函數顯然不是 UTF-8 安全的,因此不是多字節安全的(與問題中未定義的相同):
If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a xA0
sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):
<?php
/**
* is PHP str_word_count() multibyte safe?
* @link https://stackoverflow.com/q/8290537/367456
*/
echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "
";
$test = "awordxA0bword aword";
$result = str_word_count($test, 2);
var_dump($result);
輸出:
New Locale: en_US.utf8
array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}
正如 這個演示所展示的,該功能在手冊頁上給出的區域設置承諾完全失敗(我不要對此感到奇怪或抱怨,最常見的是,如果您讀到某個函數在 PHP 中是特定于語言環境的,那么您將終生運行并找到一個不是的),我在這里利用它來證明它對 UTF- 沒有任何作用-8個字符編碼.
As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.
對于 UTF-8,您應該查看 PCRE 擴展名:
Instead for UTF-8 you should take a look into the PCRE extension:
- 在 PCRE/PHP 中匹配 Unicode 字母字符
PCRE 對 PHP 中的 Unicode 和 UTF-8 有很好的理解.如果您仔細制作正則表達式模式,它也可以非常快.
PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.
這篇關于PHP str_word_count() 多字節安全嗎?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!