問題描述
我對(duì)數(shù)據(jù)庫工作還很陌生,所以請(qǐng)耐心等待.我已經(jīng)閱讀了許多類似的問題,但似乎沒有一個(gè)在談?wù)撐颐媾R的同一問題.
I am still new to working in databases, so please have patience with me. I have read through a number of similar questions, but none of them seem to be talking about the same issue I am facing.
只是一些關(guān)于我在做什么的信息,我有一個(gè)填滿聯(lián)系信息的表格,一些聯(lián)系人是重復(fù)的,但大多數(shù)重復(fù)的行都有一個(gè)截?cái)嗟碾娫捥?hào)碼,這使得這些數(shù)據(jù)毫無用處.
Just a bit of info on what I am doing, I have a table filled with contact information, and some of the contacts are duplicated, but most of the duplicated rows have a truncated phone number, which makes that data useless.
我編寫了以下查詢來搜索重復(fù)項(xiàng):
I wrote the following query to search for the duplicates:
WITH CTE (CID, Firstname, lastname, phone, email, length, dupcnt) AS
(
SELECT
CID, Firstname, lastname, phone, email, LEN(phone) AS length,
ROW_NUMBER() OVER (PARTITION BY Firstname, lastname, email
ORDER BY Firstname) AS dupcnt
FROM
[data.com_raw]
)
SELECT *
FROM CTE
WHERE dupcnt > 1
AND length <= 10
我假設(shè)此查詢會(huì)根據(jù)我指定的三列查找所有具有重復(fù)項(xiàng)的記錄,并選擇 dupcnt
大于 1 的任何記錄,以及具有長(zhǎng)度的電話列小于或等于 10.但是當(dāng)我多次運(yùn)行查詢時(shí),每次執(zhí)行都會(huì)得到不同的結(jié)果集.一定有一些我在這里遺漏的邏輯,但我對(duì)此完全感到困惑.所有列都是 varchar
數(shù)據(jù)類型,除了 CID,它是 int
.
I assumed that this query would find all records that have duplicates based on the three columns that I have specified, and select any that have the dupcnt
greater than 1, and a phone column with a length less than or equal to 10. But when I run the query more than once I get different result sets each execution. There must be some logic that I am missing here, but I am completely baffled by this. All of the columns are of varchar
datatype, except for CID, which is int
.
推薦答案
代替 ROW_NUMBER()
使用 COUNT(*)
,并刪除 ORDER BY 因?yàn)槟遣皇潜仨毷褂?COUNT(*)
.
Instead of ROW_NUMBER()
use COUNT(*)
, and remove the ORDER BY since that's not necessary with COUNT(*)
.
按照您現(xiàn)在的方式,您正在通過 firstname
/lastname
/email
將記錄分成相似的記錄組/分區(qū).然后您按 firstname
對(duì)每個(gè)組/分區(qū)進(jìn)行排序.Firstname
是分區(qū)的一部分,這意味著該組/分區(qū)中的每個(gè)名字都是相同的.您將獲得不同的結(jié)果,具體取決于 SQL Server 從存儲(chǔ)中獲取結(jié)果的方式(它首先找到的記錄是 1
,第二個(gè)找到的是 2
).每次獲取記錄時(shí)(每次運(yùn)行此 sql 時(shí)),它都可能以不同的順序從磁盤或緩存中獲取每條記錄.
The way you have it now, you are chunking up records into similar groups/partitions of records by firstname
/lastname
/email
. Then you are ORDERING each group/partition by firstname
. Firstname
is part of the partition, meaning every firstname in that group/partition is identical. You will get different results depending on how SQL Server fetches the results from storage (which record it found first is 1
, what it found second is 2
). Every time it fetches records (every time you run this sql) it may fetch each record from disk or cache at a different order.
Count(*)
將返回所有重復(fù)的行
改為:
COUNT(*) OVER (PARTITION BY Firstname, lastname, email ) AS dupcnt
這將返回共享相同名字、姓氏和電子郵件的記錄數(shù).然后您保留任何大于 1 的記錄.
Which will return the number of records that share the same firstname, lastname, and email. You then keep any record that is greater than 1.
這篇關(guān)于相同的查詢給出不同的結(jié)果的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!