問題描述
如何對 DataFrame 進行排序,以便回收"重復列中的行?
How can I sort a DataFrame so that rows in the duplicate column are "recycled"?
例如,我原來的 DataFrame 是這樣的:
For example, my original DataFrame looks like this:
In [3]: df
Out[3]:
A B
0 r1 0
1 r1 1
2 r2 2
3 r2 3
4 r3 4
5 r3 5
我希望它轉向:
In [3]: df_sorted
Out[3]:
A B
0 r1 0
2 r2 2
4 r3 4
1 r1 1
3 r2 3
5 r3 5
對行進行排序,使得列 A
中的行處于回收"狀態(tài).時尚.
Rows are sorted such that rows in columns A
are in a "recycled" fashion.
我在 Pandas 中搜索過 API,但似乎沒有任何合適的方法可以這樣做.我可以編寫一個復雜的函數(shù)來完成此操作,但只是想知道是否有任何智能方法或現(xiàn)有的 pandas 方法可以做到這一點?提前非常感謝.
I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.
更新:為錯誤的陳述道歉.在我真正的問題中,列 B
包含字符串值.
Update:
Apologies for a wrong statement. In my real problem, column B
contains string values.
推薦答案
你可以使用cumcount
用于計算列 A
中的重復項,然后是 sort_values
首先由 A
(在示例沒必要,在實際數(shù)據(jù)中可能很重要),然后通過 C
.最后刪除列 C
由 <代碼>丟棄:
You can use cumcount
for counting duplicates in column A
, then sort_values
first by A
(in sample not necessary, in real data maybe important) and then by C
. Last remove column C
by drop
:
df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
print (df)
A B C
0 r1 0 0
2 r2 2 0
4 r3 4 0
1 r1 1 1
3 r2 3 1
5 r3 5 1
df.drop('C', axis=1, inplace=True)
print (df)
A B
0 r1 0
2 r2 2
4 r3 4
1 r1 1
3 r2 3
5 r3 5
時間安排:
小df (len(df)=6
)
In [26]: %timeit (jez(df))
1000 loops, best of 3: 2 ms per loop
In [27]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
大 df (len(df)=6000
)
In [23]: %timeit (jez(df))
100 loops, best of 3: 3.44 ms per loop
In [28]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
計時代碼:
df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
df.drop('C', axis=1, inplace=True)
return (df)
def boud(df):
df['C'] = df.groupby('A')['B'].rank()
df = df.sort_values(['C', 'A'])
df.drop('C', axis=1, inplace=True)
return (df)
100 loops, best of 3: 4.29 ms per loop
這篇關于按重復對 DataFrame 的行進行排序的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!