国外成人在线视频网站,欧美精品免费在线,久久全国免费视频

本文介紹了PHP str_word_count() 多字節安全嗎?的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我想在 UTF-8 字符串上使用 str_word_count().

I want to use str_word_count() on a UTF-8 string.

這在 PHP 中安全嗎?在我看來應該是(特別是考慮到沒有 mb_str_word_count()).

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

但是在 php.net 上有很多人通過展示他們自己的多字節兼容"版本函數.

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

所以我想我想知道...

So I guess I want to know...

鑒于 str_word_count 只是計算由 " "(空格)分隔的所有字符序列，它在多字節字符串上應該是安全的，即使它不一定知道字符序列，對嗎?

Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

UTF-8 中是否有任何等效的空格"字符，它們不是 ASCII " " (space)?#

Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

我猜這就是問題所在.

推薦答案

我覺得你猜對了.事實上，UTF-8 中有一些不屬于 US-ASCII 的空格字符.給你一個這樣的空間的例子:

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

Unicode 字符 'NO-BREAK SPACE' (U+00A0):UTF-8 中的 2 個字節:0xC2 0xA0 (c2a0)

Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

也許還有:

Unicode 字符 'NEXT LINE (NEL)' (U+0085)):UTF-8 中的 2 個字節:0xC2 0x85 (c285)
Unicode 字符 'LINE SEPARATOR' (U+2028):UTF-8 中的 3 個字節:0xE2 0x80 0xA8 (e280a8)
Unicode 字符PARAGRAPH SEPARATOR"(U+2029):UTF-8 中的 3 個字節:0xE2 0x80 0xA8 (e280a8)

Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

無論如何，第一個 - 'NO-BREAK SPACE' (U+00A0) - 是一個很好的例子，因為它也是拉丁 X 字符集的一部分.并且 PHP 手冊已經提供了一個提示，即 str_word_count 將取決于語言環境.

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

如果我們想對此進行測試，我們可以將語言環境設置為 UTF-8，傳入一個包含 xA0 序列的無效字符串，如果這仍然算作斷字字符，該函數顯然不是 UTF-8 安全的，因此不是多字節安全的(與問題中未定義的相同):

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "

";

$test   = "awordxA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

輸出:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

正如這個演示所展示的，該功能在手冊頁上給出的區域設置承諾完全失敗(我不要對此感到奇怪或抱怨，最常見的是，如果您讀到某個函數在 PHP 中是特定于語言環境的，那么您將終生運行并找到一個不是的)，我在這里利用它來證明它對 UTF- 沒有任何作用-8個字符編碼.

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

對于 UTF-8，您應該查看 PCRE 擴展名:

Instead for UTF-8 you should take a look into the PCRE extension:

在 PCRE/PHP 中匹配 Unicode 字母字符

PCRE 對 PHP 中的 Unicode 和 UTF-8 有很好的理解.如果您仔細制作正則表達式模式，它也可以非常快.

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

這篇關于PHP str_word_count() 多字節安全嗎?的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

PHP str_word_count() 多字節安全嗎?

問題描述

推薦答案

相關文檔推薦