問題描述
我有一個(gè) pandas DataFrame,它詳細(xì)說明了用戶會(huì)話期間的點(diǎn)擊"方面的在線活動(dòng).有多達(dá) 50,000 個(gè)獨(dú)立用戶,數(shù)據(jù)框有大約 150 萬個(gè)樣本.顯然大多數(shù)用戶都有多條記錄.
I have a pandas DataFrame which details online activities in terms of "clicks" during an user session. There are as many as 50,000 unique users, and the dataframe has around 1.5 million samples. Obviously most users have multiple records.
四列是唯一的用戶id,用戶開始服務(wù)Registration"的日期,用戶使用服務(wù)Session"的日期,總點(diǎn)擊次數(shù).
The four columns are a unique user id, the date when the user began the service "Registration", the date the user used the service "Session", the total number of clicks.
dataframe的組織結(jié)構(gòu)如下:
The organization of the dataframe is as follows:
User_ID Registration Session clicks
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
...
(上面還有一個(gè)以0開頭的索引,但可以將User_ID
設(shè)置為索引.)
(There is also an index above beginning with 0, but one could set User_ID
as the index.)
我想?yún)R總用戶自注冊(cè)日期以來的總點(diǎn)擊次數(shù).數(shù)據(jù)框(或 pandas Series 對(duì)象)將列出 User_ID 和Total_Number_Clicks".
I would like to aggregate the total number of clicks by the user since Registration date. The dataframe (or pandas Series object) would list User_ID and "Total_Number_Clicks".
User_ID Total_Clicks
2349876 722
1987293 341
2234214 220
9874452 1405
...
如何在 pandas 中做到這一點(diǎn)?這是由 .agg()
完成的嗎?每個(gè) User_ID
都需要單獨(dú)求和.
How does one do this in pandas? Is this done by .agg()
? Each User_ID
needs to be summed individually.
由于有 150 萬條記錄,這是否可以擴(kuò)展?
As there are 1.5 million records, does this scale?
推薦答案
IIUC你可以使用groupby
, sum
和 reset_index
:
IIUC you can use groupby
, sum
and reset_index
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df.groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
如果第一列User_ID
是index
:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2014-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.groupby(level=0)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
或者:
print df.groupby(df.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
正如 Alexander 所指出的,您需要在 groupby
之前過濾數(shù)據(jù),如果 Session
日期少于每個(gè) User_ID
的 Registration
日期:
As Alexander pointed, you need filter data before groupby
, if Session
dates is less as Registration
dates per User_ID
:
print df
User_ID Registration Session clicks
0 2349876 2012-02-22 2014-04-24 2
1 1987293 2011-02-01 2013-05-03 1
2 2234214 2012-07-22 2014-01-22 7
3 9874452 2010-12-22 2014-08-22 2
print df[df.Session >= df.Registration].groupby('User_ID')['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2234214 7
2 2349876 2
3 9874452 2
我更改了 3. 行數(shù)據(jù)以獲得更好的樣本:
I change 3. row of data for better sample:
print df
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
2234214 2012-07-22 2012-01-22 7
9874452 2010-12-22 2014-08-22 2
print df.Session >= df.Registration
User_ID
2349876 True
1987293 True
2234214 False
9874452 True
dtype: bool
print df[df.Session >= df.Registration]
Registration Session clicks
User_ID
2349876 2012-02-22 2014-04-24 2
1987293 2011-02-01 2013-05-03 1
9874452 2010-12-22 2014-08-22 2
df1 = df[df.Session >= df.Registration]
print df1.groupby(df1.index)['clicks'].sum().reset_index()
User_ID clicks
0 1987293 1
1 2349876 2
2 9874452 2
這篇關(guān)于如何通過幾列中的唯一索引對(duì) pandas 求和?的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!