从 XHR 获取价格并结合 Scrapy

2024-05-21 • 问答

我必须在此网站上抓取数据（名称、价格、描述、品牌...）：https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww%7Cnew+in%7Cnew+products%7Cclothing

我的代码是这样的：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule

class TestcrawlSpider(CrawlSpider):
    name = 'testcrawl'

    def remove_characters(self,value):
        return value.strip('\n')

    allowed_domains = ['www.asos.com']
    start_urls = ['https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww|new+in|new+products|clothing']

rules = (
    Rule(LinkExtractor(restrict_xpaths="//article[@class='_2qG85dG']/a"),callback='parse_item',follow=True),Rule(LinkExtractor(restrict_xpaths="//a[@class='_39_qNys']")),)


def parse_item(self,response):
    yield{
           'name':response.xpath("//div[@class='product-hero']/h1/text()").get(),'price':response.xpath("//span[@data-id='current-price']").get(),'description':response.xpath("//div[@class='product-description']/ul/li/text()").getall(),'about_me': response.xpath("//div[@class='about-me']//text()").getall(),'brand_description':response.xpath("//div[@class='brand-description']/p/text()").getall()
        }

但是，由于 javascript，我无法获得价格。我需要通过 XHR 获得它。我获取列表中只有一件商品的价格的代码如下：

import scrapy
import json


class Asosspider(scrapy.Spider):
    name = 'asos'
    allowed_domains = ['www.asos.com']
    start_urls = ['https://www.asos.com/api/product/catalogue/v3/stockprice?productIds=200369183&store=ROW&currency=GBP&keyStoreDataversion=hnm9sjt-28']

                   
    def parse(self,response):
        #print(response.body)
        resp = json.loads(response.text)[0]
        price = resp.get('productPrice').get('current').get('text')
        print(price)
        yield {
            'price': price

这里，我的 start_urls 是请求 URL。并且每一项都在不断变化。

项目 1：https://www.asos.com/api/product/catalogue/v3/stockprice?productIds=23443988&store=ROW&currency=GBP&keyStoreDataversion=hnm9sjt-28

项目 2：https://www.asos.com/api/product/catalogue/v3/stockprice?productIds=22495685&store=ROW&currency=GBP&keyStoreDataversion=hnm9sjt-28

只有 productsIds 发生了变化！！！

我需要在第一个代码中插入第二个代码才能获得价格吗？请问怎么做？

谢谢！

pix

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider,Rule from ..items import AsosItem class TestcrawlSpider(CrawlSpider): name = 'testcrawl' allowed_domains = ['www.asos.com'] start_urls = ['https://www.asos.com/women/new-in/new-in-clothing/cat/?cid=2623&nlid=ww|new+in|new+products|clothing'] rules = ( Rule(LinkExtractor(restrict_xpaths="//article[@class='_2qG85dG']/a"),callback='parse_item',follow=True),Rule(LinkExtractor(restrict_xpaths="//a[@class='_39_qNys']")),) def remove_characters(self,value): return value.strip('\n') def parse_item(self,response): price_url = 'https://www.asos.com' + re.search(r'window.asos.pdp.config.stockPriceApiUrl = \'(.+)\'',response.text).group(1) item = AsosItem() item['name'] = response.xpath("//div[@class='product-hero']/h1/text()").get() item['description'] = response.xpath("//div[@class='product-description']/ul/li/text()").getall() item['about_me'] = response.xpath("//div[@class='about-me']//text()").getall() item['brand_description'] = response.xpath("//div[@class='brand-description']/p/text()").getall() request = scrapy.Request(url=price_url,callback=self.parse_price) request.meta['item'] = item return request def parse_price(self,response): jsonresponse = response.json()[0] price = jsonresponse['productPrice']['current']['text'] item = response.meta['item'] item['price'] = price return item

从 XHR 获取价格并结合 Scrapy

qiao799 回答：从 XHR 获取价格并结合 Scrapy

大家都在问