問題描述
小問題:如何自動檢測 CSV 文件的第一行是否有標題?
Short question: How do I automatically detect whether a CSV file has headers in the first row?
詳細信息:我編寫了一個小型 CSV 解析引擎,將數據放入我可以作為(大約)內存數據庫訪問的對象中.原始代碼是為了解析具有可預測格式的第三方 CSV 文件而編寫的,但我希望能夠更廣泛地使用此代碼.
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
我正在嘗試找出一種可靠的方法來自動檢測 CSV 標頭的存在,以便腳本可以決定是使用 CSV 文件的第一行作為鍵名/列名還是立即開始解析數據.由于我只需要一個布爾測試,我可以在自己檢查 CSV 文件后輕松指定一個參數,但我寧愿不必(去自動化).
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
我想我必須將前 3 個解析為 ?CSV 文件的行并查找某種模式以與標題進行比較.我正在做三個特別糟糕的噩夢,其中:
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
- 由于某種原因,標題包含數字數據
- 前幾行(或 CSV 的大部分)為空
- 標題和數據看起來太相似,無法區分
如果我能得到最佳猜測"并且讓解析器因錯誤而失敗或在無法決定時發出警告,那也沒關系.如果這是在時間或計算方面非常昂貴的事情(并且花費的時間比它應該節省的時間更多),我會很高興地放棄這個想法并回到重要的事情"上.
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
我正在使用 PHP,但這更像是一個算法/計算問題,而不是特定于實現的問題.如果有我可以使用的簡單算法,那就太好了.如果你能指點我一些相關的理論/討論,那也太好了.如果有一個巨大的庫可以進行自然語言處理或 300 種不同的解析,我不感興趣.
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.
推薦答案
正如其他人所指出的,您無法以 100% 的可靠性做到這一點.然而,在某些情況下,基本正確"是有用的 - 例如,具有 CSV 導入功能的電子表格工具通常會嘗試自己解決這個問題.這里有一些啟發式方法,可以表明第一行不是標題:
As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
- 第一行的列不是字符串或為空
- 第一行的列并非都是唯一的
- 第一行似乎包含日期或其他常見數據格式(例如,xx-xx-xx)
這篇關于自動檢測文件中是否存在 CSV 標頭的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!