問題描述
假設我有一個包含隨機數的巨大列表
Let's say I have a huge list containing random numbers for example
L = [random.randrange(0,25000000000) for _ in range(1000000000)]
我需要刪除此列表中的重復項
I need to get rid of the duplicates in this list
我為包含較少元素的列表編寫了這段代碼
I wrote this code for lists containing a smaller number of elements
def remove_duplicates(list_to_deduplicate):
seen = set()
result=[]
for i in list_to_deduplicate:
if i not in seen:
result.append(i)
seen.add(i)
return result
在上面的代碼中,我創建了一個集合,以便我可以記住哪些數字已經出現在我正在處理的列表中,如果該數字不在集合中,那么我將它添加到我需要返回并保存的結果列表中它在集合中,因此它不會再次添加到結果列表中
In the code above I create a set so I can memorize what numbers have already appeared in the list I'm working on if the number is not in the set then I add it to the result list I need to return and save it in the set so it won't be added again in the result list
現在對于列表中的 1000000 個數字,一切都很好,我可以快速得到結果,但是對于比 1000000000 個問題更好的數字,我需要使用我機器上的不同內核來嘗試解決問題,然后結合多進程的結果
Now for 1000000 number in a list all is good I can get a result fast but for numbers superior to let's say 1000000000 problems arise I need to use the different cores on my machine to try and break up the problem and then combine the results from multiple processes
我的第一個猜測是讓所有進程都可以訪問一個集合,但會出現許多復雜情況一個進程如何在另一個進程添加到集合時讀取,我什至不知道是否可以在進程之間共享一個集合我知道我們可以使用隊列或管道,但我不確定如何使用它
My first guess was to make a set accessible to all processes but many complications will arise How can a process read while maybe another one is adding to the set and I don't even know if it is possible to share a set between processes I know we can use a Queue or a pipe but I'm not sure on how to use it
誰能給我一個建議,告訴我解決這個問題的最佳方法是什么?我對任何新想法都持開放態度
Can someone give me an advice on what is the best way to solve this problem I am open to any new idea
推薦答案
我懷疑即使你最好的列表足夠大,以至于多處理會改善時序.使用 numpy 和 多線程 可能是你最好的機會.
I'm skeptic even your greatest list is big enough so that multiprocessing would improve timings. Using numpy and multithreading is probably your best chance.
多處理引入了相當多的開銷并增加了內存消耗,就像前面正確提到的@Frank Merrow.但是,對于多線程來說,情況并非如此(在此范圍內).重要的是不要混淆這些術語,因為進程和線程是不一樣的.同一進程中的線程共享它們的內存,不同的進程不共享.
Multiprocessing introduces quite some overhead and increases memory consumption like @Frank Merrow rightly mentioned earlier. That's not the case (to that extend) for multithreading, though. It's important to not mix these terms up because processes and threads are not the same. Threads within the same process share their memory, distinct processes do not.
在 Python 中實現多核的問題是 GIL,它沒有允許多個線程(在同一個進程中)并行執行 Python 字節碼.一些像 numpy 這樣的 C 擴展可以釋放 GIL,這可以從多核并行和多線程中獲利.這是您通過使用 numpy 在重大改進的基礎上加快速度的機會.
The problem with going multi-core in Python is the GIL, which doesn't allow multiple threads (in the same process) to execute Python bytecode in parallel. Some C-extensions like numpy can release the GIL, this enables profiting from multi-core parallelism with multithreading. Here's your chance to get some speed up on top of a big improvement just by using numpy.
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
n_threads = 8
result = np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
使用 numpy 和線程池,拆分數組,使子數組在單獨的線程中唯一,然后連接子數組并使重新組合的數組再次唯一.最后刪除重組數組的重復項是必要的,因為在子數組中只能識別本地個重復項.
Use numpy and a thread-pool, split up the array, make the sub-arrays unique in separate threads, then concatenate the sub-arrays and make the recombined array once more unique again. The final dropping of duplicates for the recombined array is necessary because within the sub-arrays only local duplicates can be identified.
對于低熵數據(許多重復),使用 pandas.unique
而不是 numpy.unique
可以更快.不同于 numpy.unique
它還保留了外觀的順序.
For low entropy data (many duplicates) using pandas.unique
instead of numpy.unique
can be much faster. Unlike numpy.unique
it also preserves order of appearance.
請注意,只有當 numpy 函數還不是多線程的 幕后通過調用低級數學庫.因此,請務必進行測試,看看它是否真的能提高性能,不要認為這是理所當然的.
Note that using a thread-pool like above makes only sense if the numpy-function is not already multi-threaded under the hood by calling into low-level math libraries. So, always test to see if it actually improves performance and don't take it for granted.
使用范圍內的 100M 隨機生成的整數進行測試:
Tested with 100M random generated integers in the range:
- 高熵:0 - 25_000_000_000(199560 次重復)
- 低熵:0 - 1000
import time
import timeit
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
import pandas as pd
def time_stmt(stmt, title=None):
t = timeit.repeat(
stmt=stmt,
timer=time.perf_counter_ns, repeat=3, number=1, globals=globals()
)
print(f" {title or stmt}")
print(f" {min(t) / 1e9:.2f} s")
if __name__ == '__main__':
n_threads = 8 # machine with 8 cores (4 physical cores)
stmt_np_unique_pool =
"""
np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
"""
stmt_pd_unique_pool =
"""
pd.unique(np.concatenate(
Pool(n_threads).map(pd.unique, np.array_split(r, n_threads)))
).tolist()
"""
# -------------------------------------------------------------------------
print(f"
high entropy (few duplicates) {'-' * 30}
")
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
# ---
print(f"
low entropy (many duplicates) {'-' * 30}
")
r = np.random.RandomState(42).randint(0, 1000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique() & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
就像您在下面的時序中看到的那樣,僅使用 numpy 而不使用多線程已經占到了最大的性能提升.另請注意 pandas.unique()
比 numpy.unique()
(僅)對于許多重復項更快.
Like you can see in the timings below, just using numpy without multithreading already accounts for the biggest performance improvement. Also note pandas.unique()
being faster than numpy.unique()
(only) for many duplicates.
high entropy (few duplicates) ------------------------------
list(set(r))
32.76 s
np.unique(r).tolist()
12.32 s
pd.unique(r).tolist()
23.01 s
numpy.unique() & Pool
9.75 s
pandas.unique() & Pool
28.91 s
low entropy (many duplicates) ------------------------------
list(set(r))
5.66 s
np.unique(r).tolist()
4.59 s
pd.unique(r).tolist()
0.75 s
numpy.unique() & Pool
1.17 s
pandas.unique() & Pool
0.19 s
這篇關于如何使用多處理在一個非常大的列表中刪除重復項?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!