問題描述
假設 UTF-8 編碼,PHP 中的 strlen(),有沒有可能這個字符串的長度是 4?
Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
我只對 strlen() 感興趣,而不是其他函數
I'm only interested to know about strlen(), not other functions
這是字符串:
$1???2
我在自己的電腦上測試過,驗證過UTF-8編碼,得到的答案是6.
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
我在 strlen 的手冊中或我在 UTF-8 上閱讀的任何內容都沒有看到任何內容可以解釋為什么上述某些字符的計數小于 1.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS:這道題和答案(4)來自我在Ebay上買的ZCE的模擬測試.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.
推薦答案
您發布的字符串長度為 6 個字符:$1???2(美元符號,數字 1,帶分音符的小寫 i,倒問號,二分之一分數,數字二)
The string you posted is six character long: $1???2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
如果使用該字符串的 UTF-8 表示調用 strlen(),您將得到 9 個結果(可能,盡管有多種長度不同的表示).
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
然而,如果我們將該字符串存儲為 ISO 8859-1 或 CP1252,我們將有一個 6 字節長的序列,它作為 UTF-8 是合法的.將這 6 個字節重新解釋為 UTF-8 將產生 4 個字符:$1 2(美元符號,數字 1,Unicode 替換字符,數字 2).也就是說,單個字符 ' ' 的 UTF-8 編碼與三個字符???"的 ISO-8859-1 編碼相同.
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1?2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '?' is identical to the ISO-8859-1 encoding of the three characters "???".
當 UTF-8 解碼器讀取的數據不是有效的 UTF-8 數據時,通常會插入替換字符.
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
看來原來的字符串是經過多層曲解處理的;通過在非 UTF-8 數據上使用 UTF-8 解碼器(產生 $1 2),然后通過用于分析該數據的任何東西(產生 $1???2).
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1?2), and then by whatever you used to analyze that data (producing $1???2).
這篇關于strlen() 和 UTF-8 編碼的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!