問題描述
在 SQL Server 中,如果我執行以下操作:
In SQL Server, if I do the following:
Difference ('Kennady', 'Kary') : I get 2
如果我這樣做:
Difference ('Kary', 'Kennady') : I get 3.
我認為差異函數會查看底層的 Soundex 值,并給出一個 0-4 的數字,表示有多少字符是相同的.
I thought the Difference function looks at the Soundex values under the hood, and gives a 0-4 number of how many characters in place are the same.
SELECT SOUNDEX('Kennady') AS [SoundEx Kennady]
, SOUNDEX('Kary') AS [SoundEx Kary]
, DIFFERENCE ('Kennady', 'Kary') AS [Difference Kennady vs Kary]
, DIFFERENCE ('Kary', 'Kennady') AS [Difference Kary vs Kennady];
推薦答案
這是嚴格的觀察.文檔 非常清楚:
This is strictly observational. The documentation is pretty clear:
返回的整數是 SOUNDEX 值中的字符數那是一樣的.返回值范圍從 0 到 4:0表示弱相似或無相似,4 表示強相似或相同的值.
The integer returned is the number of characters in the SOUNDEX values that are the same. The return value ranges from 0 through 4: 0 indicates weak or no similarity, and 4 indicates strong similarity or the same values.
根據本文檔,返回值不應因參數的順序而異.
According to this documentation, the return value should not differ based on the order of the arguments.
來自我的查詢:Kennady"--> K530 和Kary"--> K600.它們有兩個共同的字符,所以值應該是 2.
From my queries: "Kennady" --> K530 and "Kary" --> K600. These have two characters in common, so the value should be 2.
現在,我注意到Kenn"--> K500.將Kennady"截斷為Kary"的長度會得到值3".嗯.
Now, I notice that "Kenn" --> K500. Truncating "Kennady" to the length of "Kary" results in the value "3". Hmmm.
因此,我認為 DIFFERENCE()
是使用第一個參數的長度來截斷第二個參數.這使得參數的順序很重要.先把較長的論點放在首位.
Hence, I think that DIFFERENCE()
is using the length of the first argument to truncate the second argument. That makes the order of the arguments important. Put the longer argument first.
我在其他一些字符串上試過這個.相同的模式似乎有效.我還沒有找到任何說明這種情況的文件.
I tried this out on some other strings. The same patterns seems to work. I have not found any documentation that specifies that this is the case.
我想微軟會稱其為功能"而不是錯誤";)
I suppose Microsoft would call this a "feature" and not a "bug" ;).
以上推測并不完全正確.考慮以下
The above speculation is not quite correct. Consider the following
- leepaupauld --> L114
- 利奧波德 --> L143
- leepaup --> L110
然而,
- difference(leepaupauld, leopold) = 4 (!)
- 差異(利奧波德,利波保德)= 3
- difference(leepaup, leopold) = 3 (!)
- 差異(利奧波德,利帕普)= 2
考慮到字符串的 soundex 值,(!) 是我的判斷,即結果根本沒有意義.
The (!) is my judgement that the result makes no sense at all, given the soundex values for the strings.
所以,問題不在于長度.這是@jpw 在評論中指向的底層方法.問題似乎是一個字符串中的重復匹配值.但是,根據文檔,這些不應該多次匹配同一個字符.
So, the issue isn't the length. It is the underlying method, which @jpw points to in the comment. The problem appears to be duplicate matching values in one string. However, according to the documentation, these should not match the same character multiple times.
我的建議:使用 Levenshtein 距離.這說得通.它在更長的字符串上效果更好.這是理智的.它不是內置的,但很容易在網絡上找到任何數據庫的實現.
My advice: Use Levenshtein distance. It makes sense. It works better on longer strings. It is sane. It is not built in, but it is easy enough to find an implementation on the web for any database.
這篇關于為什么在切換要比較的字符串順序時,Difference 函數會給出不同的結果?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!