問題描述
我希望為一個(gè)小型 PHP/MySQL 應(yīng)用程序?qū)崿F(xiàn)模糊搜索.具體來說,我有一個(gè)包含大約 2400 條記錄的數(shù)據(jù)庫(kù)(記錄以每年大約 600 條的速度添加,因此它是一個(gè)小型數(shù)據(jù)庫(kù)).三個(gè)感興趣的字段是街道地址、姓氏和日期.我希望能夠通過這些字段之一進(jìn)行搜索,并且基本上可以容忍拼寫/字符錯(cuò)誤.即,123 Main Street"的地址還應(yīng)與123 Main St"、123 Main St."、123 Mian St"、123 Man St"、132 Main St"等匹配,名稱也是如此和日期.
I'm looking to implement fuzzy search for a small PHP/MySQL application. Specifically, I have a database with about 2400 records (records added at a rate of about 600 per year, so it's a small database). The three fields of interest are street address, last name and date. I want to be able to search by one of those fields, and essentially have tolerance for spelling/character errors. i.e., an address of "123 Main Street" should also match "123 Main St", "123 Main St.", "123 Mian St", "123 Man St", "132 Main St", etc. and likewise for name and date.
我在回答其他類似問題時(shí)遇到的主要問題:
The main issues I have with answers to other similar questions:
- 不可能為每個(gè)可能的錯(cuò)誤拼寫定義同義詞,忘記為日期和名稱定義同義詞.
- Lucene 等對(duì)于如此有限的搜索數(shù)據(jù)集來說似乎非常重要(稱之為最多 5,000 條記錄,每條記錄 3 個(gè)字段).
- 僅僅使用通配符來處理所有可能的拼寫錯(cuò)誤似乎不合邏輯.
有什么建議嗎?我知道用 MySQL 是不可能在本機(jī)上做的,但是由于數(shù)據(jù)集非常有限,我想保持它相對(duì)簡(jiǎn)單......也許是一個(gè)可以獲取all 來自數(shù)據(jù)庫(kù)的記錄,使用某種比較算法,并返回相似記錄的 ID?
Any suggestions? I know it isn't going to be possible to do natively with MySQL, but since the data set is so limited, I'd like to keep it relatively simple... perhaps a PHP class that gets all of the records from the DB, uses some sort of comparison algorithm, and returns the IDs of the similar records?
謝謝,杰森
推薦答案
Razzie 的回答(或使用 Damerau–Levenshtein) 根據(jù)與搜索關(guān)鍵字的接近程度對(duì)候選匹配列表進(jìn)行排名.(注意:如果鍵是12 Main St",則13 Main St"與12 Moin St"的打字距離相同,但您可能希望將其排在低位甚至排除它,如 11 和 22 Main St等)
Razzie's answer (or using Damerau–Levenshtein) ranks a list of candidates matches according to their closeness to the search key. (Take care: if the key is "12 Main St" then "13 Main St" has the same typing distance as "12 Moin St" but you might want to rank it low or even exclude it, as with 11 and 22 Main St etc.)
但是你如何選擇一個(gè)規(guī)模可控的候選人名單來進(jìn)行排名?
But how do you select a list of candidates of a manageable size to rank?
一種方法是為您要搜索的字符串中的每個(gè)單詞計(jì)算變音素值(或值,使用雙變音素).使用包含原始字符串的行的 id 將這些變音符中的每一個(gè)保存在另一個(gè)表中.然后,您可以使用 LIKE 'key%' 快速搜索這些變音位值,其中 key 是搜索文本中單詞的變音位.
One way is to compute the metaphone value (or values, using double-metaphone) for each word in the strings your going to search. Save each of these metaphones in another table with the id of the row containing the original string. You can then search these metaphone values quickly with LIKE 'key%' where key is the metaphone of a word from the search text.
在這個(gè)主題上查看建議的答案.它非常簡(jiǎn)潔,對(duì)于不是很大的 DB 應(yīng)該可以很好地工作.
Check out the suggested answer on this thread. It's quite neat and should work nicely for DBs that aren't huge.
這篇關(guān)于PHP/MySQL 小規(guī)模模糊搜索的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!