我正在抓取Google 学者学者个人资料页面。当我尝试抓取每位作者的标题时,我遇到一个问题,每位作者都有500多个标题,并且使用“加载更多”按钮进行显示,我有 loadmore 分页的链接。
问题是,我想计算一位作者的书名总数,但我没有得到正确的总价值。当我尝试仅抓取2位作者时,它返回正确的值,但是当我尝试抓取一页中的所有作者(一页中 10位作者)时,我得到的总价值是错误的。
我的代码如下。我的逻辑错在哪里?
def parse(self,response):
for author_sel in response.xpath('.//div[@class="gsc_1usr"]'): // loop to get all the author in a page
link = author_sel.xpath(".//h3[@class='gs_ai_name']/a/@href").extract_first()
url = response.urljoin(link)
yield scrapy.Request(url,callback=self.parse_url_to_crawl)
def parse_url_to_crawl(self,response):
url = response.url
yield scrapy.Request(url+'&cstart=0&pagesize=100',callback=self.parse_profile_content)
def parse_profile_content(self,response):
url = response.url
idx = url.find("user")
_id = url[idx+5:idx+17]
name = response.xpath("//div[@id='gsc_prf_in']/text()").extract()[0]
tmp = response.xpath('//tbody[@id="gsc_a_b"]/tr[@class="gsc_a_tr"]/td[@class="gsc_a_t"]/a/text()').extract() //it extracts the title
item = GooglescholarItem()
n = len(tmp)
titles=[]
if tmp:
offset = 0; d = 0
idx = url.find('cstart=')
idx += 7
while url[idx].isdigit():
offset = offset*10 + int(url[idx])
idx += 1
d += 1
self.n += len(tmp)
titles.append(self.n)
self.totaltitle = titles[-1]
logging.info('inside if URL is: %s',url[:idx-d] + str(offset+100) + '&pagesize=100')
yield scrapy.Request(url[:idx-d] + str(offset+100) + '&pagesize=100',self.parse_profile_content)
else:
item = GooglescholarItem()
item['name'] = name
item['totaltitle'] = self.totaltitle
self.n=0
self.totaltitle=0
yield item
这是结果,但是总标题值我错了。克劳斯·罗伯特·穆勒(Klaus-RobertMüller)总共拥有837个头衔,汤姆·米切尔(Tom Mitchell)拥有264个头衔。对于日志,请参阅附件图像。我知道我的逻辑有问题
[
{"name": "Carl Edward Rasmussen","totaltitle": 1684},{"name": "Carlos Guestrin","totaltitle": 365},{"name": "Chris Williams","totaltitle": 1072},{"name": "Ruslan Salakhutdinov","totaltitle": 208},{"name": "Sepp Hochreiter","totaltitle": 399},{"name": "Tom Mitchell","totaltitle": 282},{"name": "Johannes Brandstetter","totaltitle": 1821},{"name": "Klaus-Robert Müller","totaltitle": 549},{"name": "Ajith Abraham","totaltitle": 1259},{"name": "Amit kumar","totaltitle": 1127}
]