問題描述
我在運行 linux 的服務器上有一個包含 Unicode 字符的文件.如果我通過 SSH 連接到服務器并使用制表符完成導航到包含 unicode 字符的文件/文件夾,則訪問該文件/文件夾沒有問題.當我嘗試通過 PHP 訪問文件時出現問題(我訪問文件系統的函數是 stat
).如果我將 PHP 腳本生成的路徑輸出到瀏覽器并將其粘貼到終端中,該文件似乎也存在(即使在終端中查看文件路徑完全相同).
我通過 php_ini 將 PHP 設置為使用 UTF8 作為其默認編碼,并設置了 mb_internal_encoding
.我檢查了 PHP 文件路徑字符串編碼,它應該是 UTF8.再仔細研究一下,我決定 hexdump
終端制表符完成的 é 字符,并將其與 PHP 腳本創建的常規"é 字符的 hexdump
進行比較或通過鍵盤手動輸入字符(在 os x 上為 option+e+e).結果如下:
允許在終端中正確引用文件的 é 字符是 3 字節字符.我不確定從哪里開始,我應該在 PHP 中使用什么編碼?我應該通過 iconv
或 mb_convert_encoding
將路徑轉換為另一種編碼嗎?
多虧了兩個答案中給出的提示,我能夠四處探索并找到一些方法來規范化給定字符的不同 unicode 分解.在我遇到的情況下,我正在訪問由 OS X Carbon 應用程序創建的文件.這是一個相當流行的應用程序,因此它的文件名似乎遵循特定的 unicode 分解.
在 PHP 5.3 中引入了一個 新的函數集,允許您可以將 unicode 字符串規范化為特定的分解.顯然,您可以將 unicode 字符串分解為四種分解標準.Python 從 2.3 版開始通過 unicode.normalize 具有 unicode 規范化功能.這篇文章關于python對unicode字符串的處理有助于理解編碼/字符串處理好一點.
以下是規范化 unicode 文件路徑的快速示例:
filePath = unicodedata.normalize('NFD', filePath)
我發現 NFD 格式適用于我的所有目的,我想知道這是否是 unicode 文件名的標準分解.
I have a file containing Unicode characters on a server running linux. If I SSH into the server and use tab-completion to navigate to the file/folder containing unicode characters I have no problem accessing the file/folder. The problem arises when I try accessing the file via PHP (the function I was accessing the file system from was stat
). If I output the path generated by the PHP script to the browser and paste it into the terminal the file also seems to exist (even though looking at the terminal the file paths are exactly the same).
I set PHP to use UTF8 as its default encoding via php_ini as well as set mb_internal_encoding
. I checked the PHP filepath string encoding and it comes out as UTF8, as it should. Poking around a bit more I decided to hexdump
the é character that the terminal's tab-completion and compare it to the hexdump
of the 'regular' é character created by the PHP script or by manually entering in the character via keyboard (option+e+e on os x). Here is the result:
echo -n é | hexdump 0000000 cc65 0081 0000003 echo -n é | hexdump 0000000 a9c3 0000002
The é character that allows a correct file reference in the terminal is the 3-byte one. I'm not sure where to go from here, what encoding should I use in PHP? Should I be converting the path to another encoding via iconv
or mb_convert_encoding
?
Thanks to the tips given in the two answers I was able to poke around and find some methods for normalizing the different unicode decompositions of a given character. In the situation I was faced with I was accessing files created by a OS X Carbon application. It is a fairly popular application and thus its file names seemed to adhere to a specific unicode decomposition.
In PHP 5.3 a new set of functions was introduced that allows you to normalize a unicode string to a particular decomposition. Apparently there are four decomposition standards which you can decompose you unicode string into. Python has had unicode normalization capabilties since version 2.3 via unicode.normalize. This article on python's handling of unicode strings was helpful in understanding encoding / string handling a bit better.
Here is a quick example on normalizing a unicode filepath:
filePath = unicodedata.normalize('NFD', filePath)
I found that the NFD format worked for all my purposes, I wonder if this is this is the standard decomposition for unicode filenames.
這篇關于PHP 中的 UTF8 文件名和不同的 Unicode 編碼的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!