問題描述
我有一個(非常簡單的)熊貓數據框,看起來像這樣:
I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
我現在想做的是獲取所有時間戳在 3 秒內的重復消息.期望的輸出是:
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
沒有第三行,因為它的文本與第一行和第二行相同,但它的時間戳不是3秒以內.
without the third row, as its text is the same as in row one and two, but its timestamp is not within the range of 3 seconds.
我嘗試將列 datetime 和 msg 定義為 duplicate()
方法的參數,但它返回一個空數據幀,因為時間戳不相同:
I tried to define the columns datetime and msg as parameters for the duplicate()
method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
有沒有一種方法可以為我的日期時間"參數定義一個范圍?為了說明,某事喜歡:
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
我們將一如既往地為您提供任何幫助.
Any help here would as always be very much appreciated.
推薦答案
這段代碼給出了預期的輸出
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
我已對數據框的msg"列進行分組,然后選擇該數據框的日期時間"列并使用內置函數 差異.Diff 函數查找該列的值之間的差異.用零填充 NaT 值并僅選擇那些值小于 3 秒的索引.
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
在使用上述代碼之前,請確保您的數據框按日期時間升序排序.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
這篇關于 pandas 數據框:基于列和時間范圍的重復的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!