問題描述
我有這個代碼來將數字 html 實體解碼為 UTF8 等效字符.
I have this code to decode numeric html entities to the UTF8 equivalent character.
我正在嘗試轉換這個字符:
I'm trying to convert this character:
’
應該輸出:
’
然而,它只是消失了(沒有輸出).(我已經檢查了頁面的源代碼,該頁面具有正確的 utf8 字符集標題/元標記).
However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).
有人知道代碼有什么問題嗎?
Does anyone know what is wrong with the code?
function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {
$string = html_entity_decode($string, $quote_style, $charset);
$string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);
//this is another method, which also doesn't work..
//$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);
return $string;
}
function chr_utf8_callback($matches) {
return chr_utf8(hexdec($matches[1]));
}
function chr_utf8($num) {
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
function entity_decode_callback($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}
echo '=' . entity_decode('’');
推薦答案
html_entity_decode
已經滿足您的需求:
$string = '’';
echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');
它將返回字符:
’ binary hex: c292
這是私人使用二 (U+0092).由于它是私人使用,您的 PHP 配置/版本/編譯可能根本不會返回它.
Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.
還有一些怪癖:
但在 HTML 中(XHTML 除外,它使用 XML 規則),這是一個長期存在的瀏覽器怪癖,字符引用范圍為 €
到 Ÿ
被誤解為與 Windows 西方代碼頁 (cp1252) 中的字節 128 到 159 相關聯的字符,而不是具有這些代碼點的 Unicode 字符.HTML5 標準最終記錄了這種行為.
But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range
€
toŸ
are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.
參見:’正在被 nokogiri 在 ruby?? on rails 中轉換為u0092"
這篇關于通過 PHP 解碼數字 html 實體的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!