問題描述
我想確保我對 UTF-8 的了解都是正確的.我已經嘗試使用 UTF-8 有一段時間了,但我不斷遇到越來越多的錯誤和其他奇怪的事情,這使得擁有 100% UTF-8 站點幾乎是不可能的.總有一個我似乎想念的地方.也許這里有人可以更正我的列表或確定它,這樣我就不會錯過任何重要的事情.
I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while now but I keep stumbling across more and more bugs and other weird things that make it seem almost impossible to have a 100% UTF-8 site. There is always a gotcha somewhere that I seem to miss. Perhaps someone here can correct my list or OK it so I don't miss anything important.
數據庫
每個站點都必須將數據存儲在某處.無論您的 PHP 設置是什么,您還必須配置數據庫.如果您無法訪問配置文件,請確保在連接后立即SET NAMES 'utf8'".另外,請確保在所有表上使用 utf8_unicode_ci.這假設 MySQL 作為數據庫,您將不得不為其他數據庫更改.
Every site has to store there data somewhere. No matter what your PHP settings are you must also configure the DB. If you can't access the config files then make sure to "SET NAMES 'utf8'" as soon as you connect. Also, make sure to use utf8_ unicode_ ci on all of your tables. This assumes MySQL for a database, you will have to change for others.
正則表達式
我做了很多 更復雜的正則表達式 比您的平均搜索替換.我必須記住使用/u"修飾符,以便 PCRE 不會破壞我的字符串.然而,即便如此,顯然仍然存在問題.
I do a LOT of regex that is more complex than your average search-replace. I have to remember to use the "/u" modifier so that PCRE doesn't corrupt my strings. Yet, even then there are still problems apparently.
字符串函數
所有默認字符串函數(strlen()、strpos() 等)都應替換為 多字節字符串函數查看字符而不是字節.
All of the default string functions (strlen(), strpos(), etc.) should be replaced with Multibyte String Functions that look at the character instead of the byte.
標題您應該確保您的服務器為瀏覽器返回正確的標頭,以了解您嘗試使用的字符集(就像您必須告訴 MySQL 一樣).
Headers You should make sure that your server is returning the correct header for the browser to know what charset you are trying to use (just like you must tell MySQL).
header('內容類型:text/html;charset=utf-8');
header('Content-Type: text/html; charset=utf-8');
輸入正確的 < 也是一個好主意.meta > 頁頭中的標簽.盡管實際的標題會在它們不同時覆蓋它.
It is also a good idea to put the correct < meta > tag in the page head. Though the actual header will override this should they differ.
<meta http-equiv="Content-Type" content="text/html;charset=utf-8">
問題
我是否需要在頁面加載時將從用戶代理(HTML 表單的 & URI)接收到的所有內容轉換為 UTF-8,或者我是否可以將字符串/值保持原樣并仍然運行它們?功能沒有問題?
Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads or if I can just leave the strings/values as they are and still run them through these functions without a problem?
如果我確實需要將所有內容都轉換為 UTF-8 - 那么我應該采取哪些步驟?mb_detect_encoding 似乎是為此而構建的,但我保留看到人們抱怨它并不總是有效.mb_check_encoding 似乎也有問題告訴來自格式錯誤的一個很好的 UTF-8 字符串.
If I do need to convert everything to UTF-8 - then what steps should I take? mb_detect_encoding seems to be built for this but I keep seeing people complain that it doesn't always work. mb_check_encoding also seems to have a problem telling a good UTF-8 string from a malformed one.
PHP 是否根據使用的編碼(如文件類型)以不同方式在內存中存儲字符串,還是仍像常規字符串一樣存儲,其中某些字符的解釋方式不同(如 & amp; vs &; 在 HTML 中). chazomaticus 回答了這個問題:
在 PHP 中(至少到 PHP5),字符串只是字節序列.有沒有隱含或顯式的字符集與他們有關;那是什么程序員必須跟蹤.
In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of.
如果將非 UTF-8 字符串提供給 mb_* 函數會導致問題嗎?
If a give a non-UTF-8 string to a mb_* function will it ever cause a problem?
如果 UTF 字符串編碼不正確,會出現問題(例如正則表達式中的解析錯誤?)還是只會將實體標記為錯誤 (html)?是否有可能不正確編碼的字符串會導致函數返回 FALSE,因為字符串是壞的?
If a UTF string is improperly encoded will something go wrong (like a parsing error in regex?) or will it just mark an entity as bad (html)? Is there ever a chance that improperly encoded strings will result in function returning FALSE because the string is bad?
我聽說您也應該將表單標記為 UTF-8 (accept-charset="UTF-8"),但我不確定這樣做的好處是什么..?
I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8") but I am not sure what the benefit is..?
編寫 UTF-16 是為了解決 UTF-8 中的限制嗎?就像 UTF-8 的字符空間不足?(Y2(UTF)k?)
Was UTF-16 written to address a limit in UTF-8? Like did UTF-8 run out of space for characters? (Y2(UTF)k?)
功能
以下是我發現的幾個自定義 PHP 函數,但我沒有任何方法來驗證它們是否確實有效.也許有人有一個我可以使用的例子.首先是 convertToUTF8() 然后是seek_utf8來自 wordpress.
Here are are a couple of the custom PHP functions I have found but I haven't any way to verify that they actually work. Perhaps someone has an example which I can use. First is convertToUTF8() and then seems_utf8 from wordpress.
function seems_utf8($str) {
$length = strlen($str);
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
function is_utf8($str) {
$c=0; $b=0;
$bits=0;
$len=strlen($str);
for($i=0; $i<$len; $i++){
$c=ord($str[$i]);
if($c > 128){
if(($c >= 254)) return false;
elseif($c >= 252) $bits=6;
elseif($c >= 248) $bits=5;
elseif($c >= 240) $bits=4;
elseif($c >= 224) $bits=3;
elseif($c >= 192) $bits=2;
else return false;
if(($i+$bits) > $len) return false;
while($bits > 1){
$i++;
$b=ord($str[$i]);
if($b < 128 || $b > 191) return false;
$bits--;
}
}
}
return true;
}
如果有人感興趣,我找到了一個很好的示例頁面來使用 在測試 UTf-8 時.
If anyone is interested I found a great example page to use when testing UTf-8.
推薦答案
我是否需要在頁面加載時將從用戶代理(HTML 表單和 URI)收到的所有內容轉換為 UTF-8
Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads
沒有.用戶代理應以 UTF-8 格式提交數據;否則,您將失去 Unicode 的優勢.
No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.
確保用戶代理以 UTF-8 格式提交的方法是提供包含它以 UTF-8 編碼提交的表單的頁面.使用 Content-Type 標頭(如果您打算保存表單并獨立工作,也可以使用元 http-equiv).
The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).
我聽說您也應該將表單標記為 UTF-8 (accept-charset="UTF-8")
I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")
不要. 在 HTML 標準中這是一個不錯的主意,但 IE 從來沒有把它做好.它應該聲明一個允許字符集的排他列表,但 IE 將其視為一個額外的字符集列表,以每個字段為基礎進行嘗試.因此,如果您有一個 ISO-8859-1 頁面和一個accept-charset="UTF-8""形式,IE 將首先嘗試將字段編碼為 ISO-8859-1,如果有非 8859-1字符在那里,然后它會求助于 UTF-8.
Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an "accept-charset="UTF-8"" form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.
但是由于 IE 沒有告訴您它使用的是 ISO-8859-1 還是 UTF-8,所以這對您絕對沒有用.對于每個字段,您必須分別猜測正在使用哪種編碼!沒用處.省略該屬性并將您的頁面作為 UTF-8 提供;這是你目前能做的最好的事情.
But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.
如果 UTF 字符串編碼不當會出錯
If a UTF string is improperly encoded will something go wrong
如果您讓這樣的序列進入瀏覽器,您可能會遇到麻煩.存在超長序列",它們在比所需更長的字節序列中編碼低編號的代碼點.這意味著如果您通過在字節序列中查找該 ASCII 字符來過濾<",您可能會遺漏一個,并讓腳本元素進入您認為是安全文本的內容.
If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.
過長的序列在 Unicode 的早期就被禁止了,但是微軟花了很長時間才把它們放在一起:IE 將字節序列 'xC0xBC' 解釋為 '<' 直到 IE6Service Pack 1.Opera 在(我認為)版本 7 之前也出錯了.幸運的是,這些較舊的瀏覽器正在消亡,但仍然值得過濾過長的序列,以防這些瀏覽器現在仍然存在(或新的白癡瀏覽器使以后犯同樣的錯誤).您可以這樣做,并使用僅允許正確 UTF-8 通過的正則表達式來修復其他錯誤序列,例如 這個來自 W3.
Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘xC0xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.
如果您在 PHP 中使用 mb_ 函數,您可能不會遇到這些問題.我不能肯定,因為當我還在編寫 PHP 時 mb_* 是無法使用的脆弱的.
If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.
無論如何,這也是刪除控制字符的好時機,這是一個大的且通常不被重視的錯誤來源.除了 W3 正則表達式刪除的其他字符之外,我還會從提交的字符串中刪除字符 9 和 13;對于您知道不應該是多行文本框的字符串,刪除純換行符也是值得的.
In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.
編寫 UTF-16 是為了解決 UTF-8 中的限制嗎?
Was UTF-16 written to address a limit in UTF-8?
不,UTF-16 是每個代碼點兩個字節的編碼,用于在內存中更輕松地對 Unicode 字符串進行索引(從所有 Unicode 都適合兩個字節的日子開始;Windows 和 Java 等系統仍然這樣做)就這樣).與 UTF-8 不同,它與 ASCII 不兼容,并且在 Web 上幾乎沒有用處.但是您偶爾會在保存的文件中遇到它,通常是 Windows 用戶保存的文件,這些用戶被 Windows 在另存為"菜單中將 UTF-16LE 描述為Unicode"所誤導.
No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as "Unicode" in Save-As menus.
seems_utf8
與正則表達式相比,這非常低效!
This is very inefficient compared to the regex!
另外,請確保在您的所有表上使用 utf8_unicode_ci.
Also, make sure to use utf8_unicode_ci on all of your tables.
如果沒有這個,您實際上可以逃脫,將 MySQL 視為僅存儲字節的存儲,并且僅在腳本中將它們解釋為 UTF-8.使用 utf8_unicode_ci 的優點是它會根據非 ASCII 字符的知識進行整理(排序和不區分大小寫的比較),例如.‘?’和‘?’是同一個字符.如果您使用非 UTF8 歸類,則應堅持使用二進制(區分大小寫)匹配.
You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘?’ and ‘?’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.
無論您選擇哪種方式,都要始終如一:為您的表使用與為您的連接所做的相同的字符集.您想要避免的是腳本和數據庫之間的有損字符集轉換.
Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.
這篇關于我的 PHP 應用程序是否正確支持 UTF-8?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!