問題描述
樣本數據
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.
我有一些這樣的代碼
df = pd.DataFrame()
for line in sys.stdin:
data = json.loads(line)
# df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
df1['time'] = pd.to_datetime(df1['time'])
df = df.append(df1)
# df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()
print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
電流輸出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30
2
預期輸出
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
由于數據正在流式傳輸.我想檢查重復記錄(其商家和金額值相同)是否在兩分鐘內到達,所以我將其丟棄并且不對其進行處理.將其打印為副本.
As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.
我必須對索引壓縮或 groupby 做些什么嗎?但是然后如何等同于多列.或者兩列上有一些滾動條件,但找不到任何方法.
Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.
我在這里錯過了什么?
謝謝
編輯
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
# res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
res['timediff'] = (data['transaction']['time'] - res['time'])
if len(res.index) >1:
print(res)
所以我嘗試這樣的事情,如果結果小于 120 秒,我可以處理它.但生成的df目前以
so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of
merchant amount time concat timediff
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20
2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00
2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30
2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1 天 +23:59:20 這種格式我覺得可以用絕對值代替?
-1 days +23:59:20 this format I think can be delt with taking Absolute value?
如何將時間轉換為可以與 120 秒比較的格式?pd.to_deltatime() 對我不起作用,或者我使用錯誤.
how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.
推薦答案
所以我讓它工作但不是滾動窗口,因為它不支持字符串類型.該功能也在 Pandas Repo 上報告和請求.
So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.
我的問題解決方案片段:
My solution snippet to the problem:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
樣本數據:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
輸出:
merchant amount time
2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00
2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01
2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10
2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00
2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20
這篇關于如何根據 pandas 滾動窗口中的多列查找重復項?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!