欧美亚洲伦理,国产精品免费视频观看,99久久99久久精品免费看蜜桃

本文介紹了用問號替換無效的 UTF-8 字符，mbstring.substitute_character 似乎被忽略了的處理方法，對大家解決問題具有一定的參考價值，需要的朋友們下面隨著小編來一起學習吧！

問題描述

我想用引號 (PHP 5.3.5) 替換無效的 UTF-8 字符.

I would like to replace invalid UTF-8 chars with quotation marks (PHP 5.3.5).

到目前為止我有這個解決方案，但無效字符被刪除，而不是被?"替換.

So far I have this solution, but invalid characters are removed, instead of being replaced by '?'.

function replace_invalid_utf8($str)
{
  return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

echo mb_substitute_character()."
";

echo replace_invalid_utf8('éééaaaàààee??')."
";
echo replace_invalid_utf8('eeeaaaaaaee??')."
";

應該輸出:

63 // ASCII code for '?' character
???aaa???eé // or ??aa??eé
eeeaaaaaaeeé

但目前輸出:

63
aaaee // removed invalid characters
eeeaaaaaaeeé

有什么建議嗎?

你會用另一種方式來做嗎(例如使用 preg_replace()?)

Would you do it another way (using a preg_replace() for example?)

謝謝.

推薦答案

您可以使用mb_convert_encoding() 或 htmlspecialchars() 的 ENT_SUBSTITUTE> 自 PHP 5.4 起的選項.當然，您也可以使用 preg_match().如果您使用 intl，則可以使用 UConverter 自 PHP 5.5 起.

You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

無效字節序列的推薦替代字符是U+FFFD.參見3.1.2 替換格式錯誤的子序列"；在 UTR #36:Unicode 安全注意事項中的詳細信息.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

使用 mb_convert_encoding() 時，您可以通過將 Unicode 代碼點傳遞給 mb_substitute_character() 或 mbstring.substitute_character 指令來指定替換字符.替換的默認字符是?(問號 - U+003F).

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);

function replace_invalid_byte_sequence($str)
{
    return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
    return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

UConverter 提供面向過程和面向對象的 API.

UConverter offers both procedual and object-oriented API.

function replace_invalid_byte_sequence3($str)
{
    return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence4($str)
{
    return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

使用preg_match()時，需要注意字節范圍，避免UTF-8非最短格式的漏洞.尾字節的范圍根據前導字節的范圍而變化.

When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

您可以參考以下資源來檢查字節范圍.

you can refer to the following resources for checking the byte range.

"UTF-8 字節序列的語法"在 RFC 3629 中
"表 3-7.格式良好的 UTF-8 字節序列"在 Unicode 標準 6.1 中
"多語言表單編碼"在 W3C 國際化中"

"Syntax of UTF-8 Byte Sequences" in RFC 3629
"Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
"Multilingual form encoding" in W3C Internationalization"

字節范圍表如下.

      Code Points    First Byte Second Byte Third Byte Fourth Byte
  U+0000 -   U+007F   00 - 7F
  U+0080 -   U+07FF   C2 - DF    80 - BF
  U+0800 -   U+0FFF   E0         A0 - BF     80 - BF
  U+1000 -   U+CFFF   E1 - EC    80 - BF     80 - BF
  U+D000 -   U+D7FF   ED         80 - 9F     80 - BF
  U+E000 -   U+FFFF   EE - EF    80 - BF     80 - BF
 U+10000 -  U+3FFFF   F0         90 - BF     80 - BF    80 - BF
 U+40000 -  U+FFFFF   F1 - F3    80 - BF     80 - BF    80 - BF
U+100000 - U+10FFFF   F4         80 - 8F     80 - BF    80 - BF

如何在不破壞有效字符的情況下替換無效字節序列見"3.1.1 格式錯誤的子序列"在 UTR #36:Unicode 安全注意事項和表 3-8.U+FFFD在UTF-8轉換中的使用"在 Unicode 標準中.

How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

Unicode 標準顯示了一個示例:

The Unicode Standard shows an example:

before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after:  <0061  FFFD      FFFD   FFFD  0062  FFFD  0063  FFFD  FFFD  0064>

這里是 preg_replace_callback() 根據上述規則的實現.

Here is the implementation by preg_replace_callback() according to the above rule.

function replace_invalid_byte_sequence5($str)
{
    // REPLACEMENT CHARACTER (U+FFFD)
    $substitute = "xEFxBFxBD";
    $regex = '/
      ([x00-x7F]                       #   U+0000 -   U+007F
      |[xC2-xDF][x80-xBF]            #   U+0080 -   U+07FF
      | xE0[xA0-xBF][x80-xBF]       #   U+0800 -   U+0FFF
      |[xE1-xECxEExEF][x80-xBF]{2} #   U+1000 -   U+CFFF
      | xED[x80-x9F][x80-xBF]       #   U+D000 -   U+D7FF
      | xF0[x90-xBF][x80-xBF]{2}    #  U+10000 -  U+3FFFF
      |[xF1-xF3][x80-xBF]{3}         #  U+40000 -  U+FFFFF
      | xF4[x80-x8F][x80-xBF]{2})   # U+100000 - U+10FFFF
      |(xE0[xA0-xBF]                  #   U+0800 -   U+0FFF (invalid)
      |[xE1-xECxEExEF][x80-xBF]    #   U+1000 -   U+CFFF (invalid)
      | xED[x80-x9F]                  #   U+D000 -   U+D7FF (invalid)
      | xF0[x90-xBF][x80-xBF]?      #  U+10000 -  U+3FFFF (invalid)
      |[xF1-xF3][x80-xBF]{1,2}       #  U+40000 -  U+FFFFF (invalid)
      | xF4[x80-x8F][x80-xBF]?)     # U+100000 - U+10FFFF (invalid)
      |(.)                               # invalid 1-byte
    /xs';

    // $matches[1]: valid character
    // $matches[2]: invalid 3-byte or 4-byte character
    // $matches[3]: invalid 1-byte

    $ret = preg_replace_callback($regex, function($matches) use($substitute) {

        if (isset($matches[2]) || isset($matches[3])) {

            return $substitute;

        }
    
        return $matches[1];

    }, $str);

    return $ret;
}

通過這種方式可以直接比較字節，避免preg_match對字節大小的限制.

You can compare byte directly and avoid preg_match's restriction about byte size by this way.

function replace_invalid_byte_sequence6($str) {

    $size = strlen($str);
    $substitute = "xEFxBFxBD";
    $ret = '';

    $pos = 0;
    $char;
    $char_size;
    $valid;

    while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
        $ret .= $valid ? $char : $substitute;
    }

    return $ret;
}

function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
    $valid = false;

    if ($str_size <= $pos) {
        return false;
    }

    if ($str[$pos] < "x80") {

        $valid = true;
        $char_size =  1;

    } else if ($str[$pos] < "xC2") {

        $char_size = 1;

    } else if ($str[$pos] < "xE0")  {

        if (!isset($str[$pos+1]) || $str[$pos+1] < "x80" || "xBF" < $str[$pos+1]) {

            $char_size = 1;

        } else {

            $valid = true;
            $char_size = 2;

        }

    } else if ($str[$pos] < "xF0") {

        $left = "xE0" === $str[$pos] ? "xA0" : "x80";
        $right = "xED" === $str[$pos] ? "x9F" : "xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else {

            $valid = true;
            $char_size = 3;

       }

    } else if ($str[$pos] < "xF5") {

        $left = "xF0" === $str[$pos] ? "x90" : "x80";
        $right = "xF4" === $str[$pos] ? "x8F" : "xBF";

        if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

            $char_size = 1;

        } else if (!isset($str[$pos+2]) || $str[$pos+2] < "x80" || "xBF" < $str[$pos+2]) {

            $char_size = 2;

        } else if (!isset($str[$pos+3]) || $str[$pos+3] < "x80" || "xBF" < $str[$pos+3]) {

            $char_size = 3;

        } else {

            $valid = true;
            $char_size = 4;

        }

    } else {

        $char_size = 1;

    }

    $char = substr($str, $pos, $char_size);
    $pos += $char_size;

    return true;
}

測試用例在這里.

function run(array $callables, array $arguments)
{
    return array_map(function($callable) use($arguments) {
         return array_map($callable, $arguments);
    }, $callables);
}
    
$data = [
    // Table 3-8. Use of U+FFFD in UTF-8 Conversion
    // http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
    "x61"."xF1x80x80"."xE1x80"."xC2"."x62"."x80"."x63"
    ."x80"."xBF"."x64",

    // 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
    "xF0x9Fx8Cx95"."xF0x9Fx8C"."xF0x9Fx8C"
];

var_dump(run([
    'replace_invalid_byte_sequence', 
    'replace_invalid_byte_sequence2',
    'replace_invalid_byte_sequence3',
    'replace_invalid_byte_sequence4',
    'replace_invalid_byte_sequence5',
    'replace_invalid_byte_sequence6'
], $data));

請注意，mb_convert_encoding 有一個錯誤，它會在無效字節序列之后立即中斷有效字符，或者在不添加 U+FFFD 的情況下刪除有效字符之后的無效字節序列.

As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

$data = [
    // U+20AC
    "xE2x82xAC"."xE2x82xAC"."xE2x82xAC",
    "xE2x82"    ."xE2x82xAC"."xE2x82xAC",

    // U+24B62
    "xF0xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",
    "xF0xA4xAD"    ."xF0xA4xADxA2"."xF0xA4xADxA2",
    "xA4xADxA2"."xF0xA4xADxA2"."xF0xA4xADxA2",

    // 'FULL MOON SYMBOL' (U+1F315)
    "xF0x9Fx8Cx95" . "xF0x9Fx8C",
    "xF0x9Fx8Cx95" . "xF0x9Fx8C" . "xF0x9Fx8C"
];

盡管 preg_match() 可以代替 preg_replace_callback 使用，但此函數對字節大小有限制.有關詳細信息，請參閱錯誤報告 #36463.可以通過下面的測試用例來確認.

Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

str_repeat('a', 10000)

最后，我的基準測試結果如下.

Finally, the result of my benchmark is following.

mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727

基準代碼在這里.

function timer(array $callables, array $arguments, $repeat = 10000)
{

    $ret = [];
    $save = $repeat;

    foreach ($callables as $key => $callable) {

        $start = microtime(true);

        do {
    
            array_map($callable, $arguments);

        } while($repeat -= 1);

        $stop = microtime(true);
        $ret[$key] = $stop - $start;
        $repeat = $save;

    }

    return $ret;
}

$functions = [
    'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
    'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
    'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
    'UConverter::convert()' => 'replace_invalid_byte_sequence4',
    'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
    'direct comparision' => 'replace_invalid_byte_sequence6'
];

foreach (timer($functions, $data) as $description => $time) {

    echo $description, PHP_EOL,
         $time, PHP_EOL;

}

這篇關于用問號替換無效的 UTF-8 字符，mbstring.substitute_character 似乎被忽略了的文章就介紹到這了，希望我們推薦的答案對大家有所幫助，也希望大家多多支持html5模板網！

【網站聲明】本站部分內容來源于互聯網,旨在幫助大家更快的解決問題，如果有圖片或者內容侵犯了您的權益，請聯系我們刪除處理，感謝您的支持！

久久久久久久av_日韩在线中文_看一级毛片视频_日本精品二区_成人深夜福利视频_武道仙尊动漫在线观看

用問號替換無效的 UTF-8 字符，mbstring.substitute_c

問題描述

推薦答案

相關文檔推薦