問題描述
大家好.我有一個包含相當多行的 MSSQL 2008 數據庫.到目前為止,在將新行插入表中之前,存儲過程會檢查該記錄是否已存在于數據庫中(通過檢查標記為 Title 的列).這個檢查是精確的,如果要插入的記錄略有不同,它會插入它而不是更新現有行(這是一個近似匹配).我想做的是在插入之前以某種方式檢測表中的近似重復.所以要插入的新記錄:
Hey all. I have a MSSQL 2008 database with a fair number of rows. As of now, before new rows are inserted into the table, the stored procedure checks to see if that record already exists in the database (by checking a column labeled Title). This check is exact, and if the to-be-inserted record is slightly different, it will insert it instead of updating the existing row (which is an approximate match). What I would like to do is somehow detect approximate duplications in the table before inserting. So a new record that is to be inserted:
The quick brown fox jumps over the lazy dog
大致匹配:
Quick brown fox jumps over the lazy dog
如果該記錄已經存在于表中.我已經看到(并用于其他情況)在 T-SQL 中實現的 Levenshtein Distance 算法,但我不確定這是否適用于我的情況,因為執行算法需要一對輸入字符串.社區成員如何處理此類事情?謝謝.
if this record exists in the table already. I've seen (and used for other situations) the Levenshtein Distance algorithm implemented in T-SQL, but I'm not sure if this could be applied in my case because a pair of input strings are required to execute the algorithm. How are members of the community handing things of this sort? Thanks.
推薦答案
全文搜索是您最好的選擇.由于需要大量的計算,在任何非平凡大小的文本語料庫上使用 Levenshtein 很快就會出現問題.對于基于字符的差異而不是基于單詞的差異,更常見的是使用 LD/SOUNDEX 等.假設單詞至少拼寫正確,FTS 會更合適.我還可以想象一種使用 FTS 來識別可能的匹配候選者的兩層方法,并在過濾后的集合上執行更細粒度的匹配.如果你真的想去城里,那么搜索文本的最佳結構之一是 Trie,但這在表中實現起來很棘手,并且作為內存中的數據結構效果更好.基于單詞的 n-gram 解決方案也可能值得研究.
Full-Text Search is your best bet here. Using Levenshtein on any non-trivial sized corpus of text soon becomes problematic due to the computational grunt required. It's more common to use LD/SOUNDEX etc for character based discrepancies rather than word based discrepancies. Assuming words are at minimum correctly spelled, FTS would be a better fit. I can also imagine a two-tiered approach using FTS to identify likely match candidates, with finer grained matching performed over the filtered set. If you really want to go to town, then one of the best performing structures for searching text is the Trie, but this is tricky to implement in tables, and works better as an in-memory data-structure. A word based n-gram solution might also be worth investigating.
這篇關于使用 T-SQL 查找近似重復的數據庫記錄?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!