問題描述
我需要在 SQL 中找到兩個字符串的公共子字符串(沒有空格).
查詢:
選擇 *從 tbl 為 a,tbl 為 b其中 a.str <>b.str
示例數據:
str1 |str2 |不帶空格的最大子串----------+-----------+---------------------------——aabcdfbas |里克迪瓦 |發展基金aaab akuc |aaabir |aaabab akuc |ab atr |AB
我不同意那些說 SQL 不是這項工作的工具的人.在有人向我展示比我的任何編程語言的解決方案更快的方法之前,我會斷言 SQL(以基于集合的方式編寫,無副作用,僅使用不可變變量)是唯一em> 用于此工作的工具(處理 varchar(8000)- 或 nvarchar(4000) 時).下面的解決方案是針對 varchar(8000).
1.正確索引的計數(數字)表.
-- (1) 構建并填充一個持久化的(數字)計數如果 OBJECT_ID('dbo.tally') 不是 NULL DROP TABLE dbo.tally;CREATE TABLE dbo.tally (n int not null);WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))) t(N))插入數據庫SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;-- (2) 為性能添加必需的約束(和索引)ALTER TABLE dbo.tally添加約束 pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;ALTER TABLE dbo.tally添加約束 uq_tally UNIQUE NONCLUSTERED(N);
請注意,計數表功能的性能將不佳.
2.使用我們的計數表將所有可能的子串返回一個字符串
以abcd"為例,讓我們獲取它的所有子字符串.注意我的評論.
聲明@s1 varchar(8000) = 'abcd';選擇位置 = t.N,tokenSize = x.N,string = substring(@s1, t.N, x.N)FROM dbo.tally t -- 令牌位置CROSS JOIN dbo.tally x -- 令牌長度WHERE t.N <= len(@s1) -- 所有位置AND x.N <= len(@s1) -- 所有長度AND len(@s1) - t.N - (x.N-1) >= 0 -- 過濾不必要的行 [e.g.substring('abcd',3,2)]
返回
position tokenSize 字符串----------- ----------- -------1 1 一個2 1 乙3 1 c4 1 天1 2 抗體2 2 公元前3 2 cd1 3 美國廣播公司2 3 bcd1 4 abcd
3.dbo.getshortstring8K
這個功能是什么?第一個重大優化.我們將把兩個字符串中較短的字符串分成每個可能的子字符串,然后查看它是否存在于較長的字符串中.如果您有兩個字符串(S1 和 S2)并且 S1 比 S2 長,我們知道 S1 的任何一個比 S2 長的子字符串都不是 S2 的子字符串.這就是 dbo.getshortstring 的目的:確保我們不執行任何不必要的子字符串比較.這會更有意義.
這非常重要,因為可以使用
...沒有排序或不必要的操作.速度而已.對于更長的字符串(例如 50 個字符以上),我有一個更快的技術,您可以閱讀關于 這里.
I need to find common substring (without space) of two strings in SQL.
Query:
select *
from tbl as a, tbl as b
where a.str <> b.str
Sample data:
str1 | str2 | max substring without spaces
----------+-----------+-----------------------------
aabcdfbas | rikcdfva | cdf
aaab akuc | aaabir a | aaab
ab akuc | ab atr | ab
I disagree with those who say that SQL is not the tool for this job. Until someone can show me a faster way than my solution in ANY programming language, I will aver that SQL (written in a set-based, side-effect free, using only immutable variables) is the ONLY tool for this job (when dealing with varchar(8000)- or nvarchar(4000)). The solution below is for varchar(8000).
1. A correctly indexed tally (numbers) table.
-- (1) build and populate a persisted (numbers) tally
IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;
CREATE TABLE dbo.tally (n int not null);
WITH DummyRows(V) AS(SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(N))
INSERT dbo.tally
SELECT TOP (8000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM DummyRows a CROSS JOIN DummyRows b CROSS JOIN DummyRows c CROSS JOIN DummyRows d;
-- (2) Add Required constraints (and indexes) for performance
ALTER TABLE dbo.tally
ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR = 100;
ALTER TABLE dbo.tally
ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);
Note that a tally table function will not perform as well.
2. Using our tally table to return all possible substrings a string
Using "abcd" as an example, let's get all of it's substrings. Note my comments.
DECLARE @s1 varchar(8000) = 'abcd';
SELECT
position = t.N,
tokenSize = x.N,
string = substring(@s1, t.N, x.N)
FROM dbo.tally t -- token position
CROSS JOIN dbo.tally x -- token length
WHERE t.N <= len(@s1) -- all positions
AND x.N <= len(@s1) -- all lengths
AND len(@s1) - t.N - (x.N-1) >= 0 -- filter unessesary rows [e.g.substring('abcd',3,2)]
This returns
position tokenSize string
----------- ----------- -------
1 1 a
2 1 b
3 1 c
4 1 d
1 2 ab
2 2 bc
3 2 cd
1 3 abc
2 3 bcd
1 4 abcd
3. dbo.getshortstring8K
What's this function about? The first major optimization. We're going to break the shorter of the two strings into every possible substring then see if it exists in the longer string. If you have two strings (S1 and S2) and S1 is longer than S2, we know that none of the substrings of S1, that are longer than S2, will be a substring of S2. That's the purpose of dbo.getshortstring: to ensure that we don't perform any unnecessary substring comparisons. This will make more sense in a moment.
This is hugely important because, the number of substrings in a string can be calculated using a Triangle Number Function. With N as the length (number of characters) in a string, the number of substrings can be calculated as N*(N+1)/2. E.g. "abc" has 6 substrings: 3*(3+1)/2 = 6; a,b,c,ab,bc,abc. If we're comparing "abc" to "abcdefgh" we don't need to check if "abcd" is a substring of "abc".
Breaking "abcdefgh" (length=8) into all possible substrings requires 8*(8+1)/2 = 36 operations (vs 6 for "abc").
IF OBJECT_ID('dbo.getshortstring8k') IS NOT NULL DROP FUNCTION dbo.getshortstring8k;
GO
CREATE FUNCTION dbo.getshortstring8k(@s1 varchar(8000), @s2 varchar(8000))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT s1 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s1 ELSE @s2 END,
s2 = CASE WHEN LEN(@s1) < LEN(@s2) THEN @s2 ELSE @s1 END;
4. Finding all subsrings of the shorter string that exist in the longer string:
DECLARE @s1 varchar(8000) = 'bcdabc', @s2 varchar(8000) = 'abcd';
SELECT
s.s1, -- test to make sure s.s1 is the shorter of the two strings
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s --<< get the shorter string
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0;
5. Retrieving ONLY the longest common substring(s)
This is the easy part. We simply Add TOP (1) WITH TIES
to our SELECT
statement and we're all set. Here, the longest common substring is "bc" and "xx"
DECLARE @s1 varchar(8000) = 'xxabcxx', @s2 varchar(8000) = 'bcdxx';
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(@s1, @s2) s
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
ORDER BY x.N DESC;
6. Applying this logic to your table
Using APPLY we replace my variables @s1 and @s2 with the t.str1 & t.str2. I add a filter to exclude matches that contain spaces (see my comments)... And we're off:
-- easily consumbable sample data
DECLARE @yourtable TABLE (str1 varchar(8000), str2 varchar(8000));
INSERT @yourtable
VALUES ('aabcdfbas','rikcdfva'),('aaab akuc','aaabir a'),('ab akuc','ab atr');
SELECT str1, str2, [max substring without spaces] = string
FROM @yourtable t
CROSS APPLY
(
SELECT TOP (1) WITH TIES
position = t.N,
tokenSize = x.N,
string = substring(s.s1, t.N, x.N)
FROM dbo.getshortstring8k(t.str1, t.str2) s -- @s1 & @s2 replaced with str1 & str2
CROSS JOIN dbo.tally t
CROSS JOIN dbo.tally x
WHERE t.N between 1 and len(s.s1)
AND x.N between 1 and len(s.s1)
AND len(s.s1) - t.N - (x.N-1) >= 0
AND charindex(substring(s.s1, t.N, x.N), s.s2) > 0
AND charindex(' ',substring(s.s1, t.N, x.N)) = 0 -- exclude substrings with spaces
ORDER BY x.N DESC
) lcss;
Results:
str1 str2 max substring without spaces
----------- --------- ------------------------------
aabcdfbas rikcdfva cdf
aaab akuc aaabir a aaab
ab akuc ab atr ab
And the execution plan:
... No sorts or unnecessary operations. Just speed. For longer strings (e.g. 50 characters+) I have an even faster technique you can read about here.
這篇關于SQL中兩個字符串的公共子串的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!