scrapy给了我一个不完整的链接,我需要解析内部页面

因此,从技术上讲,当我告诉Scrapy进行抓取时,Scrapy会给我正确的信息:

link = row.xpath('.//p/a/@href').extract_first()

问题是我正在获取“ / biz / polkadog-bakery-boston?osq = Dog”,如HTML代码所示(请参阅图片1),但我想获取(图片中的2)“ { {3}}”,仅当我将鼠标悬停在“链接”上时才会显示。

https://www.yelp.com/biz/polkadog-bakery-boston?osq=Dog

我想要这个,以便我可以解析内部页面中的信息。

我试图寻找这样的东西,但是我没有运气。

如果我不够清楚,请给我知道,然后再给我差价。

谢谢

这是完整的蜘蛛:

from scrapy import Spider
from yelp.items import YelpItem
import scrapy
import re 


class YelpSpider(Spider):
    name = "yelp"
    allowed_domains = ['www.yelp.com']
    # Defining the list of pages to scrape
    start_urls = ["https://www.yelp.com/search?find_desc=Dog&find_loc=Boston%2C%20MA&start=" + str(10 * i) for i in range(0,1)] 



def parse(self,response):
    # Defining rows to be scraped
    rows = response.xpath('//*[@id="wrap"]/div[3]/div[2]/div[2]/div/div[1]/div[1]/div/ul/li')
    for row in rows:

        # Scraping Busines' Name
        name = row.xpath('.//p/a/text()').extract_first()

        # Scraping Phone number
        phone = row.xpath('.//div[1]/p[1][@class= "lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--right__373c0__3ARv7"]/text()').extract_first()

        # scraping area
        area = row.xpath('.//p/span[@class = "lemon--span__373c0__3997G"]/text()').extract_first()

        # Scraping services they offer
        services = row.xpath('.//a[@class="lemon--a__373c0__IEZFH link__373c0__29943 link-color--inherit__373c0__15ymx link-size--default__373c0__1skgq"]/text()').extract_first()

        # Extracting internal link
        link = row.xpath('.//p/a/@href').extract_first()



        item = YelpItem()    
        item['name'] = name
        item['phone'] = phone
        item['area'] = area
        item['services'] = services
        item['link'] = link


        yield item

def parse_detail(self,response):
    item = response.meta['item']
    address = response.xpath('.//*[@id="wrap"]/div[2]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[1]/div/strong/address/text()[1]').extract_first()

    item['address'] = address

    yield item
qushiyi0120 回答:scrapy给了我一个不完整的链接,我需要解析内部页面

您需要使用response.urljoin()

link = row.xpath('.//p/a/@href').extract_first()
link = response.urljoin(link)
本文链接:https://www.f2er.com/3149399.html

大家都在问