問(wèn)題描述
我正在嘗試在處理文本時(shí)將所有類型的智能引號(hào)轉(zhuǎn)換為常規(guī)引號(hào).但是,我編譯的以下函數(shù)似乎仍然缺乏支持和適當(dāng)?shù)脑O(shè)計(jì).
I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.
有誰(shuí)知道如何正確轉(zhuǎn)換所有引號(hào)字符?
Does anyone know how to properly get all quote characters converted?
function convert_smart_quotes($string)
{
$quotes = array(
"xC2xAB" => '"', // ? (U+00AB) in UTF-8
"xC2xBB" => '"', // ? (U+00BB) in UTF-8
"xE2x80x98" => "'", // ‘ (U+2018) in UTF-8
"xE2x80x99" => "'", // ’ (U+2019) in UTF-8
"xE2x80x9A" => "'", // ? (U+201A) in UTF-8
"xE2x80x9B" => "'", // ? (U+201B) in UTF-8
"xE2x80x9C" => '"', // " (U+201C) in UTF-8
"xE2x80x9D" => '"', // " (U+201D) in UTF-8
"xE2x80x9E" => '"', // ? (U+201E) in UTF-8
"xE2x80x9F" => '"', // ? (U+201F) in UTF-8
"xE2x80xB9" => "'", // ? (U+2039) in UTF-8
"xE2x80xBA" => "'", // ? (U+203A) in UTF-8
);
$string = strtr($string, $quotes);
// Version 2
$search = array(
chr(145),
chr(146),
chr(147),
chr(148),
chr(151)
);
$replace = array("'","'",'"','"',' - ');
$string = str_replace($search, $replace, $string);
// Version 3
$string = str_replace(
array('‘','’','“','”'),
array("'", "'", '"', '"'),
$string
);
// Version 4
$search = array(
'‘',
'’',
'“',
'”',
'—',
'–',
);
$replace = array("'","'",'"','"',' - ', '-');
$string = str_replace($search, $replace, $string);
return $string;
}
注意:這個(gè)問(wèn)題是一個(gè)完整的查詢,包括此處詢問(wèn)Microsoft"引號(hào) 這是一個(gè)重復(fù)",就像詢問(wèn)所有輪胎尺寸是詢問(wèn)汽車輪胎尺寸的重復(fù)"一樣.
Note: This question is a complete query about the full of gamut of quotes including the "Microsoft" quotes asked here This is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.
推薦答案
你需要這樣的東西(假設(shè) UTF-8 輸入,忽略 CJK(中文、日語(yǔ)、韓語(yǔ))):
You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):
$chr_map = array(
// Windows codepage 1252
"xC2x82" => "'", // U+0082?U+201A single low-9 quotation mark
"xC2x84" => '"', // U+0084?U+201E double low-9 quotation mark
"xC2x8B" => "'", // U+008B?U+2039 single left-pointing angle quotation mark
"xC2x91" => "'", // U+0091?U+2018 left single quotation mark
"xC2x92" => "'", // U+0092?U+2019 right single quotation mark
"xC2x93" => '"', // U+0093?U+201C left double quotation mark
"xC2x94" => '"', // U+0094?U+201D right double quotation mark
"xC2x9B" => "'", // U+009B?U+203A single right-pointing angle quotation mark
// Regular Unicode // U+0022 quotation mark (")
// U+0027 apostrophe (')
"xC2xAB" => '"', // U+00AB left-pointing double angle quotation mark
"xC2xBB" => '"', // U+00BB right-pointing double angle quotation mark
"xE2x80x98" => "'", // U+2018 left single quotation mark
"xE2x80x99" => "'", // U+2019 right single quotation mark
"xE2x80x9A" => "'", // U+201A single low-9 quotation mark
"xE2x80x9B" => "'", // U+201B single high-reversed-9 quotation mark
"xE2x80x9C" => '"', // U+201C left double quotation mark
"xE2x80x9D" => '"', // U+201D right double quotation mark
"xE2x80x9E" => '"', // U+201E double low-9 quotation mark
"xE2x80x9F" => '"', // U+201F double high-reversed-9 quotation mark
"xE2x80xB9" => "'", // U+2039 single left-pointing angle quotation mark
"xE2x80xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));
這里是背景:
每個(gè) Unicode 字符都只屬于一個(gè) "General Category",其中可以包含引號(hào)的字符字符如下:
Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:
Ps
標(biāo)點(diǎn)符號(hào),打開(kāi)"莉>Pe
標(biāo)點(diǎn)符號(hào),關(guān)閉"莉>Pi
"標(biāo)點(diǎn)符號(hào)、初始引號(hào)(可能表現(xiàn)得像 Ps 或Pe 取決于使用情況)"Pf
"標(biāo)點(diǎn)符號(hào),最后引用(可能表現(xiàn)得像 Ps 或Pe 取決于使用情況)"Po
標(biāo)點(diǎn)符號(hào),其他"莉>
(這些頁(yè)面可以方便地檢查您是否沒(méi)有遺漏任何內(nèi)容 - 還有一個(gè) 索引類別)
(these pages are handy for checking that you didn't miss anything - there is also an index of categories)
有時(shí)在支持 Unicode 的正則表達(dá)式中匹配這些類別很有用.
It is sometimes useful to match these categories in a Unicode-enabled regex.
此外,Unicode 字符具有屬性",其中您感興趣的是Quotation_Mark
.不幸的是,這些不能在正則表達(dá)式中訪問(wèn).
Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark
. Unfortunately, these are not accessible in a regex.
在維基百科中,您可以找到具有 Quotation_Mark
屬性的字符組.最后一個(gè)參考是 unicode.org 上的 PropList.txt,但這是一個(gè) ASCII 文本文件.
In Wikipedia you can find the group of characters with the Quotation_Mark
property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.
如果您也需要翻譯 CJK 字符,您只需獲取它們的代碼點(diǎn),決定它們的翻譯,并找到它們的 UTF-8 編碼,例如,通過(guò)在 fileformat.info 中查找(例如,對(duì)于 U+301E:http://www.fileformat.info/info/unicode/char/301e/index.htmlhtm).
In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).
關(guān)于 Windows 代碼頁(yè) 1252:Unicode 定義了前 256 個(gè)代碼點(diǎn)來(lái)表示與 ISO-8859-1,但 ISO-8859-1 經(jīng)常與 Windows 代碼頁(yè) 1252,以便所有瀏覽器呈現(xiàn)范圍 0x80-0x9F,這在 ISO-8859-1 中為空"(更準(zhǔn)確地說(shuō):它包含控制字符),就好像它是 Windows 代碼頁(yè) 1252.維基百科頁(yè)面中的表格列出了 Unicode 等效項(xiàng).
Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.
注意:strtr()
通常比 str_replace()
.使用您的輸入和 PHP 版本計(jì)時(shí).如果速度夠快,可以直接用我的$chr_map
之類的地圖.
如果您不確定您的輸入是否是 UTF-8 編碼,并且愿意假設(shè)如果不是,那么它是 ISO-8859-1 或 Windows 代碼頁(yè) 1252,那么您可以先執(zhí)行此操作:>
If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:
if ( !preg_match('/^\X*$/u', $str)) {
$str = utf8_encode($str);
}
警告:這個(gè)正則表達(dá)式在極少數(shù)情況下可能無(wú)法檢測(cè)到非 UTF-8 編碼.例如:"Gru?..."/*CP-1252*/=="GruxDFx85"
看起來(lái)像這個(gè)正則表達(dá)式的 UTF-8(U+07C5 是 N'ko 數(shù)字 5).這個(gè)正則表達(dá)式可以稍微增強(qiáng),但不幸的是,它可以表明對(duì)于編碼檢測(cè)問(wèn)題不存在完全萬(wàn)無(wú)一失的解決方案.
Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gru?…"/*CP-1252*/=="GruxDFx85"
looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.
如果您想將源自 Windows 代碼頁(yè) 1252 的范圍 0x80-0x9F 標(biāo)準(zhǔn)化為常規(guī) Unicode 代碼點(diǎn),您可以這樣做(并刪除上面$chr_map
的第一部分):>
If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map
above):
$normalization_map = array(
"xC2x80" => "xE2x82xAC", // U+20AC Euro sign
"xC2x82" => "xE2x80x9A", // U+201A single low-9 quotation mark
"xC2x83" => "xC6x92", // U+0192 latin small letter f with hook
"xC2x84" => "xE2x80x9E", // U+201E double low-9 quotation mark
"xC2x85" => "xE2x80xA6", // U+2026 horizontal ellipsis
"xC2x86" => "xE2x80xA0", // U+2020 dagger
"xC2x87" => "xE2x80xA1", // U+2021 double dagger
"xC2x88" => "xCBx86", // U+02C6 modifier letter circumflex accent
"xC2x89" => "xE2x80xB0", // U+2030 per mille sign
"xC2x8A" => "xC5xA0", // U+0160 latin capital letter s with caron
"xC2x8B" => "xE2x80xB9", // U+2039 single left-pointing angle quotation mark
"xC2x8C" => "xC5x92", // U+0152 latin capital ligature oe
"xC2x8E" => "xC5xBD", // U+017D latin capital letter z with caron
"xC2x91" => "xE2x80x98", // U+2018 left single quotation mark
"xC2x92" => "xE2x80x99", // U+2019 right single quotation mark
"xC2x93" => "xE2x80x9C", // U+201C left double quotation mark
"xC2x94" => "xE2x80x9D", // U+201D right double quotation mark
"xC2x95" => "xE2x80xA2", // U+2022 bullet
"xC2x96" => "xE2x80x93", // U+2013 en dash
"xC2x97" => "xE2x80x94", // U+2014 em dash
"xC2x98" => "xCBx9C", // U+02DC small tilde
"xC2x99" => "xE2x84xA2", // U+2122 trade mark sign
"xC2x9A" => "xC5xA1", // U+0161 latin small letter s with caron
"xC2x9B" => "xE2x80xBA", // U+203A single right-pointing angle quotation mark
"xC2x9C" => "xC5x93", // U+0153 latin small ligature oe
"xC2x9E" => "xC5xBE", // U+017E latin small letter z with caron
"xC2x9F" => "xC5xB8", // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);
這篇關(guān)于使用 PHP 轉(zhuǎn)換所有類型的智能引號(hào)的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!