問題描述
所以,我的問題比較簡單.我有一個爬蟲爬取多個站點,我需要它按照我在代碼中編寫的順序返回數據.貼在下面.
So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem
class MLBoddsSpider(BaseSpider):
name = "sbrforum.com"
allowed_domains = ["sbrforum.com"]
start_urls = [
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
"http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
items = []
for site in sites:
item = MlboddsItem()
item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
items.append(item)
return items
結果以隨機順序返回,例如返回第 29 個,然后是第 28 個,然后是第 30 個.我嘗試將調度程序順序從 DFO 更改為 BFO,以防萬一出現問題,但這并沒有改變任何東西.
The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.
推薦答案
start_urls
定義在 start_requests
方法.下載頁面時,您的 parse
方法會調用每個起始 URL 的響應.但是你無法控制加載時間——第一個開始 url 可能會在 parse
的最后一個.
start_urls
defines urls which are used in start_requests
method. Your parse
method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse
.
一種解決方案——覆蓋 start_requests
方法,并在生成的請求中添加一個帶有 priority
鍵的 meta
.在 parse
中提取此 priority
值并將其添加到 item
.在管道中根據這個值做一些事情.(我不知道為什么以及在哪里需要按此順序處理這些 url).
A solution -- override start_requests
method and add to generated requests a meta
with priority
key. In parse
extract this priority
value and add it to the item
. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).
或者讓它同步——將這些起始網址存儲在某個地方.將 start_urls
放入其中的第一個.在 parse
中處理第一個響應并生成項目,然后從您的存儲中獲取下一個 url 并使用 parse
的回調對其發出??請求.
Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls
the first of them. In parse
process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse
.
這篇關于按順序抓取 URL的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!