問題描述
一個網絡爬蟲腳本,最多生成 500 個線程,每個線程基本上都請求從遠程服務器提供的某些數據,每個服務器的回復在內容和大小上都與其他服務器不同.
A web crawler script that spawns at most 500 threads and each thread basically requests for certain data served from the remote server, which each server's reply is different in content and size from others.
我將線程的 stack_size 設置為 756K
i'm setting stack_size as 756K's for threads
threading.stack_size(756*1024)
這使我能夠擁有足夠數量的所需線程并完成大部分作業和請求.但是由于某些服務器的響應比其他服務器大,并且當一個線程獲得這種響應時,腳本會因 SIGSEGV 而死.
which enables me to have the sufficient number of threads required and complete most of the jobs and requests. But as some servers' responses are bigger than others, and when a thread gets that kind of response, script dies with SIGSEGV.
stack_sizes 超過 756K 使得不可能同時擁有所需數量的線程.
stack_sizes more than 756K makes it impossible to have the required number of threads at the same time.
關于如何在不崩潰的情況下繼續使用給定的 stack_size 的任何建議?以及如何獲取任何給定線程的當前使用的 stack_size?
any suggestions on how can i continue with given stack_size without crashes? and how can i get the current used stack_size of any given thread?
推薦答案
為什么你到底要生成 500 個線程?這似乎是一個可怕的主意!
Why on earth are you spawning 500 threads? That seems like a terrible idea!
完全刪除線程,使用事件循環進行爬取.您的程序將更快、更簡單、更易于維護.
Remove threading completely, use an event loop to do the crawling. Your program will be faster, simpler, and easier to maintain.
大量等待網絡的線程不會讓您的程序等待得更快.相反,將所有打開的套接字收集到一個列表中并運行一個循環,檢查其中是否有任何可用的數據.
Lots of threads waiting for network won't make your program wait faster. Instead, collect all open sockets in a list and run a loop where you check if any of them has data available.
我推薦使用 Twisted - 它是一個事件驅動的網絡引擎.它非常靈活、安全、可擴展且非常穩定(無段錯誤).
I recommend using Twisted - It is an event-driven networking engine. It is very flexile, secure, scalable and very stable (no segfaults).
你也可以看看 Scrapy - 它是一個用 Python/Twisted 編寫的網絡爬取和屏幕抓取框架.它仍在大力開發中,但也許您可以提出一些想法.
You could also take a look at Scrapy - It is a web crawling and screen scraping framework written in Python/Twisted. It is still under heavy development, but maybe you can take some ideas.
這篇關于Python 線程 stack_size 和 segfaults的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!