問題描述
我有一個這樣定義的視圖:
I have a view defined like this:
CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts
FROM [dbo].AllContracts AS C
INNER JOIN [dbo].AllContracts AS CC
ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND C.AssociatedUser IS NULL
AND C.UniqueID <> CC.UniqueID
這基本上是尋找合同,其中 f.e.名字和生日匹配.這很好用.現在我想為每一行添加一個綜合屬性,其中的值僅來自一個源行.
Which is basically finding contracts where f.e. the first name and the birthday are matching. This works great. Now I want to add a synthetic attribute to each row with the value from only one source row.
讓我舉個例子讓你更清楚.假設我有下表:
Let me give you an example to make it clearer. Suppose I have the following table:
UniqueID | FirstName | LastName | Birthday
1 | Peter | Smith | 1980-11-04
2 | Peter | Gray | 1980-11-04
3 | Peter | Gray-Smith| 1980-11-04
4 | Frank | May | 1985-06-09
5 | Frank-Paul| May | 1985-06-09
6 | Gina | Ericson | 1950-11-04
結果視圖應如下所示:
UniqueID | PossiblyMatchingContracts | SyntheticID
1 | 2 | PeterSmith1980-11-04
1 | 3 | PeterSmith1980-11-04
2 | 1 | PeterSmith1980-11-04
2 | 3 | PeterSmith1980-11-04
3 | 1 | PeterSmith1980-11-04
3 | 2 | PeterSmith1980-11-04
4 | 5 | FrankMay1985-06-09
5 | 4 | FrankMay1985-06-09
6 | NULL | NULL [or] GinaEricson1950-11-04
請注意 SyntheticID 列僅使用匹配源行之一的值.哪一個都無所謂.我將此視圖導出到另一個應用程序,之后需要能夠識別每個匹配組".
Notice that the SyntheticID column uses ONLY values from one of the matching source rows. It doesn't matter which one. I am exporting this view to another application and need to be able to identify each "match group" afterwards.
明白我的意思了嗎?任何想法如何在 sql 中完成?
Is it clear what I mean? Any ideas how this could be done in sql?
也許詳細說明一下實際用例會有所幫助:
Maybe it helps to elaborate a bit on the actual use case:
我正在從不同系統導入合同.考慮到拼寫錯誤或已婚但姓氏僅在一個系統中更新的人的可能性,我需要找到所謂的可能匹配".如果兩個或多個合同包含相同的生日加上相同的名字、姓氏或出生姓名,則它們被視為可能匹配.這意味著,如果合約 A 匹配合約 B,則合約 B 也匹配合約 A.
I am importing contracts from different systems. To account for the possibility of typos or people that have married but the last name was only updated in one system, I need to find so called 'possible matches'. Two or more contracts are considered a possible match if they contain the same birthday plus the same first, last or birth name. That implies, that if contract A matches contract B, contract B also matches contract A.
目標系統使用多值引用屬性來存儲這些關系.最終目標是為這些合約創建用戶對象.首先要注意的是,對于多個匹配的合約,它應該只是一個用戶對象.因此,我在視圖中創建這些匹配項.第二個問題是,用戶對象的創建是通過工作流進行的,每個合約并行運行.為避免為匹配合約創建多個用戶對象,每個工作流都需要檢查是否已經存在匹配的用戶對象或另一個工作流,即將創建該用戶對象.由于工作流引擎與 sql 相比非常慢,因此工作流不應重復整個匹配測試.所以這個想法是,讓工作流只檢查syntheticID".
The target system uses multivalue reference attributes to store these relationships. The ultimate goal is to create user objects for these contracts. The catch first is, that the shall only be one user object for multiple matching contracts. Thus I'm creating these matches in the view. The second catch is, that the creation of user objects happens by workflows, which run parallel for each contract. To avoid creating multiple user objects for matching contracts, each workflow needs to check, if there is already a matching user object or another workflow, which is about to create said user object. Because the workflow engine is extremely slow compared to sql, the workflows should not repeat the whole matching test. So the idea is, to let the workflow check only for the 'syntheticID'.
推薦答案
我已經用多步方法解決了:
I have solved it with a multi step approach:
- 為每個合同創建可能的第一級匹配列表
- 創建基本組列表,為其分配不同的組每個合同(好像它們與任何人都沒有關系)
- 當需要更多合約時,迭代匹配列表更新組列表加入群組
- 從最終組列表中遞歸構建 SyntheticID
- 輸出結果
首先,讓我解釋一下我的理解,以便您判斷我的方法是否正確.
First of all, let me explain what I have understood, so you can tell if my approach is correct or not.
1) 匹配在級聯"中傳播
1) matching propagates in "cascade"
我的意思是,如果Peter Smith"與Peter Gray"歸為一組,則意味著所有 Smith 和所有 Gray 都是相關的(如果他們的出生日期相同),因此 Luke Smith 可以屬于 John 的同一組灰色
I mean, if "Peter Smith" is grouped up with "Peter Gray", it means that all Smith and all Gray are related (if they have the same birth date) so Luke Smith can be in the same group of John Gray
2) 我不明白你說的出生名"是什么意思
2) I have not understood what you mean with "Birth Name"
你說合同匹配名字,姓氏或出生名",對不起,我是意大利人,我認為出生名和名字是一樣的,在你的數據中也沒有這樣的列.也許它與名稱之間的破折號有關?
當 FirstName 是 Frank-Paul 時,這意味著它應該同時匹配 Frank 和 Paul?
當姓氏是 Gray-Smith 時,這意味著它應該同時匹配 Gray 和 Smith?
You say contracts matches on "first, last or birth name", sorry, I'm italian, I thought birth name and first were the same, also in your data there is not such column. Maybe it is related to that dash symbol between names?
When FirstName is Frank-Paul it means it should match both Frank and Paul?
When LastName is Gray-Smith it means it should match both Gray and Smith?
在下面的代碼中,我簡單地忽略了這個問題,但如果需要的話可以處理它(我已經嘗試過,打破名稱,取消旋轉它們并視為雙重匹配).
In following code I have simply ignored this problem, but it could be handled if needed (I already did a try, breaking names, unpivoting them and treating as double match).
步驟零:一些聲明和準備基礎數據
declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare @comb as table (id1 int, id2 int, done bit)
declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
declare @id1 as int, @g int
;with
t as (
select *
from (values
(1 , 'Peter' , 'Smith' , '1980-11-04'),
(2 , 'Peter' , 'Gray' , '1980-11-04'),
(3 , 'Peter' , 'Gray-Smith', '1980-11-04'),
(4 , 'Frank' , 'May' , '1985-06-09'),
(5 , 'Frank-Paul', 'May' , '1985-06-09'),
(6 , 'Gina' , 'Ericson' , '1950-11-04')
) x (UniqueID , FirstName , LastName , Birthday)
)
insert into @cli
select * from t
第一步:為每個合同創建可能的第一級匹配列表
;with
p as(select UniqueID, Birthday, FirstName, LastName from @cli),
m as (
select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
from p
join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
where p.UniqueID<=pp.UniqueID
)
insert into @comb
select UniqueID1,UniqueID2,0
from m
第二步:創建基本組列表
insert into @grp
select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2
第三步:迭代匹配列表更新組列表僅在需要時才循環具有可能匹配和更新的合約
Step Three: Iterate the matches list updating the group list Only loop on contracts that have possible matches and updates only if needed
set @id1 = 0
while not(@id1 is null) begin
set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)
if not(@id1 is null) begin
set @g = (select grp from @grp where id=@id1)
update g set grp= @g
from @grp g
inner join @comb c on g.id = c.id2
where c.id2<>@id1 and c.id1=@id1
and grp<>@g
update @comb set done=1 where id1=@id1
end
end
第四步:建立 SyntheticID遞歸地將組的所有(不同的)名字和姓氏添加到 SyntheticID.
我使用_"作為出生日期、名字和姓氏的分隔符,并使用,"作為姓名列表的分隔符以避免沖突.
Step Four: Build up the SyntheticID
Recursively add ALL (distinct) first and last names of group to SyntheticID.
I used '_' as separator for birth date, first names and last names, and ',' as separator for the list of names to avoid conflicts.
;with
c as(
select c.*, g.grp
from @cli c
join @grp g on g.id = c.UniqueID
),
d as (
select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
from (
select distinct c.grp g, 1 t, FirstName s from c
union
select distinct c.grp, 2, LastName from c
) l
),
r as (
select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
from d
join @cli t on t.UniqueID=d.g
where n1=1
union all
select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') + d.s as varchar(1000)), r.n1, r.n2
from d
join r on r.g = d.g and r.n1=d.n1-1
)
insert into @str_id
select g, Names
from r
where n2=1
第五步:輸出結果
select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from @cli c
left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join @grp g on c.UniqueID = g.id
left join @str_id s on s.grp = g.grp
結果如下
UniqueID PossibleMatchingContract SyntheticID
1 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
1 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
4 5 1985-06-09_Frank,Frank-Paul_May
5 4 1985-06-09_Frank,Frank-Paul_May
6 NULL 1950-11-04_Gina_Ericson
我認為這樣產生的 SyntheticID 也應該是每個組的唯一"
I think that in this way the resulting SyntheticID should also be "unique" for each group
這篇關于如何合成連接表的屬性的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!