問題描述
我正在嘗試在 pandas 數(shù)據(jù)框中查找重復(fù)行.
I am trying to find duplicates rows in a pandas dataframe.
df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])
df
Out[15]:
col1 col2
0 1 2
1 3 4
2 1 2
3 1 4
4 1 2
duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]
duplicate
Out[16]:
col1 col2
2 1 2
4 1 2
有沒有辦法添加引用第一個(gè)副本(保留的那個(gè))的索引的列
Is there a way to add a column referring to the index of the first duplicate (the one kept)
duplicate
Out[16]:
col1 col2 index_original
2 1 2 0
4 1 2 0
注意:在我的情況下,df 可能非常大....
Note: df could be very very big in my case....
推薦答案
使用groupby
,新建一列索引,然后調(diào)用duplicated
:
Use groupby
, create a new column of indexes, and then call duplicated
:
df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')
df[df.duplicated(subset=['col1','col2'], keep='first')]
col1 col2 index_original
2 1 2 0
4 1 2 0
<小時(shí)>
詳情
我groupby
前兩列然后調(diào)用transform
+ idxmin
得到每個(gè)組的第一個(gè)索引.
I groupby
first two columns and then call transform
+ idxmin
to get the first index of each group.
df.groupby(['col1', 'col2']).col1.transform('idxmin')
0 0
1 1
2 0
3 3
4 0
Name: col1, dtype: int64
duplicated
給了我想要保留的值的布爾掩碼:
duplicated
gives me a boolean mask of values I want to keep:
df.duplicated(subset=['col1','col2'], keep='first')
0 False
1 False
2 True
3 False
4 True
dtype: bool
剩下的只是布爾索引.
這篇關(guān)于在 pandas 數(shù)據(jù)框中查找重復(fù)行的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網(wǎng)!