抓刮刮多页

我有一个可以抓取单个页面的功能。跟随相应的链接后,如何抓取多个页面?我是否需要像下面的gotoIndivPage()一样调用parse()的单独函数?谢谢!

import scrapy

class trainingScraper(scrapy,Spider):
   name = "..."
   start_urls = "url with links to multiple pages"

   # for scraping individual page
   def parse(self,response):
      SELECTOR1 = '.entry-title ::text'
      SELECTOR2 = '//li[@class="location"]/ul/li/a/text()'
      yield{
         'title': response.css(SELECTOR1).extract_first(),'date': response.xpath(SELECTOR2).extract_first(),}

   def gotoIndivPage(self,response):
      PAGE_SELECTOR = '//h3[@class="entry-title"]/a/@href'
      for page in response.xpath(PAGE_SELECTOR):
         if page:
            yield scrapy.Request(
               response.urljoin(page),callback=self.parse
            )
xy13659822798 回答:抓刮刮多页

我通常会为我要抓取的每种不同类型的HTML结构创建一个新函数。因此,如果您的链接将您发送到具有不同于起始页面的HTML结构的页面,则我将创建一个新函数并将其传递给我的回调。

 def parseNextPage(self,response): 
   # Parse new page

 def parse(self,response):
       SELECTOR1 = '.entry-title ::text'
       SELECTOR2 = '//li[@class="example"]/ul/li/a/text()'

       yield{
         'title': response.css(SELECTOR1).extract_first(),'date': response.xpath(SELECTOR2).extract_first(),}
       href = //li[@class="location"]/ul/li/a/@href

       yield scrapy.Request(
           url = href,callback=self.parseNextPage
        )
本文链接:https://www.f2er.com/3106419.html

大家都在问