問題描述
我有一個(gè)這樣定義的視圖:
I have a view defined like this:
CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts
FROM [dbo].AllContracts AS C
INNER JOIN [dbo].AllContracts AS CC
ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND C.AssociatedUser IS NULL
AND C.UniqueID <> CC.UniqueID
這基本上是尋找合同,其中 f.e.名字和生日匹配.這很好用.現(xiàn)在我想為每一行添加一個(gè)綜合屬性,其中的值僅來自一個(gè)源行.
Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.
讓我舉個(gè)例子讓你更清楚.假設(shè)我有下表:
Let me give you an example to make it clearer. Suppose I have the following table:
UniqueID | FirstName | LastName | Birthday
1 | Peter | Smith | 1980-11-04
2 | Peter | Gray | 1980-11-04
3 | Peter | Gray-Smith| 1980-11-04
4 | Frank | May | 1985-06-09
5 | Frank-Paul| May | 1985-06-09
6 | Gina | Ericson | 1950-11-04
結(jié)果視圖應(yīng)如下所示:
UniqueID | PossiblyMatchingContracts | SyntheticID
1 | 2 | PeterSmith1980-11-04
1 | 3 | PeterSmith1980-11-04
2 | 1 | PeterSmith1980-11-04
2 | 3 | PeterSmith1980-11-04
3 | 1 | PeterSmith1980-11-04
3 | 2 | PeterSmith1980-11-04
4 | 5 | FrankMay1985-06-09
5 | 4 | FrankMay1985-06-09
6 | NULL | NULL [or] GinaEricson1950-11-04
請(qǐng)注意 SyntheticID 列僅使用匹配源行之一的值.哪一個(gè)都無所謂.我將此視圖導(dǎo)出到另一個(gè)應(yīng)用程序,之后需要能夠識(shí)別每個(gè)匹配組".
Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.
明白我的意思了嗎?任何想法如何在 sql 中完成?
Is it clear what I mean? Any ideas how this could be done in sql?
也許詳細(xì)說明一下實(shí)際用例會(huì)有所幫助:
Maybe it helps to elaborate a bit on the actual use case:
我正在從不同系統(tǒng)導(dǎo)入合同.考慮到拼寫錯(cuò)誤或已婚但姓氏僅在一個(gè)系統(tǒng)中更新的人的可能性,我需要找到所謂的可能匹配".如果兩個(gè)或多個(gè)合同包含相同的生日加上相同的名字、姓氏或出生姓名,則它們被視為可能匹配.這意味著,如果合約 A 匹配合約 B,則合約 B 也匹配合約 A.
I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.
目標(biāo)系統(tǒng)使用多值引用屬性來存儲(chǔ)這些關(guān)系.最終目標(biāo)是為這些合約創(chuàng)建用戶對(duì)象.首先要注意的是,對(duì)于多個(gè)匹配的合約,它應(yīng)該只是一個(gè)用戶對(duì)象.因此,我在視圖中創(chuàng)建這些匹配項(xiàng).第二個(gè)問題是,用戶對(duì)象的創(chuàng)建是通過工作流進(jìn)行的,每個(gè)合約并行運(yùn)行.為避免為匹配合約創(chuàng)建多個(gè)用戶對(duì)象,每個(gè)工作流都需要檢查是否已經(jīng)存在匹配的用戶對(duì)象或另一個(gè)工作流,即將創(chuàng)建該用戶對(duì)象.由于工作流引擎與 sql 相比非常慢,因此工作流不應(yīng)重復(fù)整個(gè)匹配測試.所以這個(gè)想法是,讓工作流只檢查syntheticID".
The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.
推薦答案
我已經(jīng)用多步方法解決了:
I have solved it with a multi step approach:
- 為每個(gè)合同創(chuàng)建可能的第一級(jí)匹配列表
- 創(chuàng)建基本組列表,為其分配不同的組每個(gè)合同(好像它們與任何人都沒有關(guān)系)
- 當(dāng)需要更多合約時(shí),迭代匹配列表更新組列表加入群組
- 從最終組列表中遞歸構(gòu)建 SyntheticID
- 輸出結(jié)果
首先,讓我解釋一下我的理解,以便您判斷我的方法是否正確.
First of all, let me explain what I have understood, so you can tell if my approach is correct or not.
1) 匹配在級(jí)聯(lián)"中傳播
1) matching propagates in "cascade"
我的意思是,如果Peter Smith"與Peter Gray"歸為一組,則意味著所有 Smith 和所有 Gray 都是相關(guān)的(如果他們的出生日期相同),因此 Luke Smith 可以屬于 John 的同一組灰色
I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray
2) 我不明白你說的出生名"是什么意思
2) I have not understood what you mean with "Birth Name"
你說合同匹配名字,姓氏或出生名",對(duì)不起,我是意大利人,我認(rèn)為出生名和名字是一樣的,在你的數(shù)據(jù)中也沒有這樣的列.也許它與名稱之間的破折號(hào)有關(guān)?
當(dāng) FirstName 是 Frank-Paul 時(shí),這意味著它應(yīng)該同時(shí)匹配 Frank 和 Paul?
當(dāng)姓氏是 Gray-Smith 時(shí),這意味著它應(yīng)該同時(shí)匹配 Gray 和 Smith?
You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?
在下面的代碼中,我簡單地忽略了這個(gè)問題,但如果需要的話可以處理它(我已經(jīng)嘗試過,打破名稱,取消旋轉(zhuǎn)它們并視為雙重匹配).
In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).
步驟零:一些聲明和準(zhǔn)備基礎(chǔ)數(shù)據(jù)
declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare @comb as table (id1 int, id2 int, done bit)
declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
declare @id1 as int, @g int
;with
t as (
select *
from (values
(1 , 'Peter' , 'Smith' , '1980-11-04'),
(2 , 'Peter' , 'Gray' , '1980-11-04'),
(3 , 'Peter' , 'Gray-Smith', '1980-11-04'),
(4 , 'Frank' , 'May' , '1985-06-09'),
(5 , 'Frank-Paul', 'May' , '1985-06-09'),
(6 , 'Gina' , 'Ericson' , '1950-11-04')
) x (UniqueID , FirstName , LastName , Birthday)
)
insert into @cli
select * from t
第一步:為每個(gè)合同創(chuàng)建可能的第一級(jí)匹配列表
;with
p as(select UniqueID, Birthday, FirstName, LastName from @cli),
m as (
select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
from p
join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
where p.UniqueID<=pp.UniqueID
)
insert into @comb
select UniqueID1,UniqueID2,0
from m
第二步:創(chuàng)建基本組列表
insert into @grp
select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2
第三步:迭代匹配列表更新組列表僅在需要時(shí)才循環(huán)具有可能匹配和更新的合約
Step Three: Iterate the matches list updating the group list Only loop on contracts that have possible matches and updates only if needed
set @id1 = 0
while not(@id1 is null) begin
set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)
if not(@id1 is null) begin
set @g = (select grp from @grp where id=@id1)
update g set grp= @g
from @grp g
inner join @comb c on g.id = c.id2
where c.id2<>@id1 and c.id1=@id1
and grp<>@g
update @comb set done=1 where id1=@id1
end
end
第四步:建立 SyntheticID遞歸地將組的所有(不同的)名字和姓氏添加到 SyntheticID.
我使用_"作為出生日期、名字和姓氏的分隔符,并使用,"作為姓名列表的分隔符以避免沖突.
Step Four: Build up the SyntheticID
Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.
;with
c as(
select c.*, g.grp
from @cli c
join @grp g on g.id = c.UniqueID
),
d as (
select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
from (
select distinct c.grp g, 1 t, FirstName s from c
union
select distinct c.grp, 2, LastName from c
) l
),
r as (
select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
from d
join @cli t on t.UniqueID=d.g
where n1=1
union all
select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') + d.s as varchar(1000)), r.n1, r.n2
from d
join r on r.g = d.g and r.n1=d.n1-1
)
insert into @str_id
select g, Names
from r
where n2=1
第五步:輸出結(jié)果
select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from @cli c
left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join @grp g on c.UniqueID = g.id
left join @str_id s on s.grp = g.grp
結(jié)果如下
UniqueID PossibleMatchingContract SyntheticID
1 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
1 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
4 5 1985-06-09_Frank,Frank-Paul_May
5 4 1985-06-09_Frank,Frank-Paul_May
6 NULL 1950-11-04_Gina_Ericson
我認(rèn)為這樣產(chǎn)生的 SyntheticID 也應(yīng)該是每個(gè)組的唯一"
I think that in this way the resulting SyntheticID should also be "unique" for each group
這篇關(guān)于如何合成連接表的屬性的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!