美女黄在线观看,在线免费色,日韩精品电影在线观看

本文介紹了確保 PHP 中的有效 UTF-8的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我使用 PHP 來處理來自各種來源的文本.我不認為它會是 UTF-8 以外的任何東西，ISO 8859-1 或 Windows-1252.如果不是其中之一，我只需要確保文本變成有效的 UTF-8 字符串，即使字符丟失.iconv 的//TRANSLIT 選項可以解決這個問題嗎?

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

例如，此代碼能否確保將字符串安全地插入到 UTF-8 編碼的文檔(或數據庫)中?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}

推薦答案

UTF-8 可以存儲任何 Unicode 字符.如果您的編碼完全不同，包括 ISO-8859-1 或 Windows-1252，則 UTF-8 可以存儲其中的每個字符.因此，當您將字符串從任何其他編碼轉換為 UTF-8 時，您不必擔心丟失任何字符.

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

此外，ISO-8859-1 和 Windows-1252 都是單字節編碼，其中任何字節都是有效的.在技??術上無法區分它們.我會選擇 Windows-1252 作為非 UTF-8 序列的默認匹配，因為唯一解碼不同的字節是范圍 0x80-0x9F.這些解碼為各種字符，如 Windows-1252 中的智能引號和歐元，而在 ISO-8859-1 中，它們是幾乎從未使用過的不可見控制字符.網絡瀏覽器有時可能會說他們使用的是 ISO-8859-1，但通常他們真的會使用 Windows-1252.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

此代碼是否可以確保將字符串安全地插入到 UTF-8 編碼的文檔中

would this code ensure that a string is safe to insert into a UTF-8 encoded document

為此，您當然希望將可選的strict"參數設置為 TRUE.但我不確定這是否真的涵蓋了所有無效的 UTF-8 序列.該函數不要求明確檢查字節序列的 UTF-8 有效性.已知有 mb_detect_encoding 之前會錯誤地猜測 UTF-8 的情況，但我不知道在嚴格模式下是否仍然會發生這種情況.

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

如果您想確定，請使用 W3 推薦的正則表達式:

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [x09x0Ax0Dx20-x7E]            # ASCII
    | [xC2-xDF][x80-xBF]             # non-overlong 2-byte
    | xE0[xA0-xBF][x80-xBF]         # excluding overlongs
    | [xE1-xECxEExEF][x80-xBF]{2}  # straight 3-byte
    | xED[x80-x9F][x80-xBF]         # excluding surrogates
    | xF0[x90-xBF][x80-xBF]{2}      # planes 1-3
    | [xF1-xF3][x80-xBF]{3}          # planes 4-15
    | xF4[x80-x8F][x80-xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

這篇關于確保 PHP 中的有效 UTF-8的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

確保 PHP 中的有效 UTF-8

問題描述

推薦答案

相關文檔推薦