問(wèn)題描述
我正在檢查 4 個(gè)相同的數(shù)據(jù)框列中是否有類似的結(jié)果(模糊匹配),并且我有以下代碼作為示例.當(dāng)我將它應(yīng)用到真正的 40.000 行 x 4 列數(shù)據(jù)集時(shí),它會(huì)一直在 eternum 中運(yùn)行.問(wèn)題是代碼太慢了.例如,如果我將數(shù)據(jù)集限制為 10 個(gè)用戶,計(jì)算需要 8 分鐘,而計(jì)算需要 20、19 分鐘.有什么我想念的嗎?我不知道為什么要花那么長(zhǎng)時(shí)間.我希望在 2 小時(shí)或更短的時(shí)間內(nèi)獲得所有結(jié)果.任何提示或幫助將不勝感激.
I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.
from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
for match in ratio:
if match[1] != 100:
result.append(match)
break
print (result)
輸出:[('asple', 80), ('tab', 80)]
Output: [('asple', 80), ('tab', 80)]
推薦答案
通過(guò)編寫矢量化操作和避免循環(huán)來(lái)顯著提高速度
Major speed improvements come by writing vectorized operations and avoiding loops
導(dǎo)入必要的包
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
從第一個(gè)列表創(chuàng)建數(shù)據(jù)框
dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']
從第二個(gè)列表創(chuàng)建數(shù)據(jù)框
compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']
Merge - 通過(guò)引入鍵(自連接)的笛卡爾積
dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
矢量化
def partial_match(x,y):
return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)
使用矢量化并通過(guò)設(shè)置閾值來(lái)獲得所需的結(jié)果
combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
結(jié)果
+--------+-----+--------+------+
| Match | Key | compare | score
+--------+-----+--------+------+
| apple | 1 | asple | 80
| tb | 1 | tab | 80
+--------+-----+--------+------+
這篇關(guān)于列表性能中的Python模糊匹配字符串的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!