在多页中刮多页（抓取）

2024-05-06 • 问答

我正在努力弄清我需要设置的代码结构，以便在多个页面中抓取多个页面。这是我的意思：

我从具有所有字母URL的主页开始。每个字母都是狗品种名称的首字母。
每个字母都有多页犬种。我需要进入每个犬种页面。
对于每种犬，都有多页出售的犬。我需要从每个销售列表页面中提取数据。

如前所述，我正在努力了解代码结构的外观。问题的一部分是我不完全了解python代码流的工作方式。这样的话是正确的吗？

def parse
       Get URL of all the alphabet letters
       pass on the URL to parse_A

def parse_A
      Get URL of all pages for that alphabet letter
      pass on the URL to parse_B

def parse_B
      Get URL for all breeds listed on that page of that alphabet letter
      pass on the URL to parse_C

def parse_C
      Get URL for all the pages of dogs listed of that specific breed
      pass on the URL to parse_D

def parse_D
      Get URL of specific for sale listing of that dog breed on that page
      pass on the URL to parse_E

def parse_E
     Get all of the details for that specific listing
     Callback to ??

对于parse_E中的最终回调，我应将回调定向到parse_D还是第一个解析？

谢谢！

def parse(): """ Get URL of all URLs from the alphabet letters (breed_urls) :return: """ breed_urls = 'parse the urls' for url in breed_urls: yield scrapy.Request(url=url,callback=self.parse_sub_urls) def parse_sub_urls(response): """ Get URL of all SubUrls from the subPage (sub_urls) :param response: :return: """ sub_urls= 'parse the urls' for url in sub_urls: yield scrapy.Request(url=url,callback=self.parse_details) next_page = 'parse the page url' if next_page: yield scrapy.Request(url=next_page,callback=self.parse_sub_urls) def parse_details(response): """ Get the final details from the listing page :param response: :return: """ details = {} name = 'parse the urls' details['name'] = name # parse all other details and append to the dictionary yield details

在多页中刮多页（抓取）

di880518 回答：在多页中刮多页（抓取）

大家都在问