美丽的汤分页，find_all在next_page类中找不到文本。还需要从URLS

2024-05-06 • 问答

我已经为此工作了一个星期，并决心让它开始工作！我的最终目标是写一个网络爬虫，您可以在其中插入县名，然后该刮虫会从面部照片中生成一个csv信息文件-名称，位置，眼睛颜色，体重，头发颜色和高度（这是我正在研究的一个遗传学项目）。

站点组织是主站点页面->状态页面->县页面-120个带有名称和url的面部照片->带有数据的url我最终是在下一个链接到另一个120链接。

我认为最好的方法是编写一个抓取器，该抓取器将从120张面部照片的表格中获取URL和名称，然后使用分页功能从县其他地方（包括某些国家/地区）获取所有URL和名称。情况有千分之十）。我可以获得前120个，但是我的分页不起作用..因此，我最终得到了包含120个名称和网址的csv。

I closely followed this article which was very helpful

from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd

county_name = input('Please,enter a county name: /Arizona/Maricopa-County-AZ \n')
print(f'Searching {county_name}. Wait,please...')

base_url = 'https://www.mugshots.com'
search_url = f'https://mugshots.com/US-Counties/{county_name}/'
data = {'Name': [],'URL': []}


def export_table_and_print(data):
  table = pd.DataFrame(data,columns=['Name','URL'])
  table.index = table.index + 1
  table.to_csv('mugshots.csv',index=False) 
  print('Scraping done. Here are the results:')
  print(table)


def get_mugshot_attributes(mugshot):
  name = mugshot.find('div',attrs={'class','label'})
  url = mugshot.find('a','image-preview'})
  name=name.text
  url=mugshot.get('href')
  url = base_url + url
  data['Name'].append(name)
  data['URL'].append(url)

def parse_page(next_url):
  page = requests.get(next_url)

  if page.status_code == requests.codes.ok:
    bs = BeautifulSoup(page.text,'lxml')

  list_all_mugshot = bs.find_all('a','image-preview'})

for mugshot in list_all_mugshot:
  get_mugshot_attributes(mugshot)

next_page_text = mugshot.find('a class',attrs={'next page'})

if next_page_text == 'Next':
  next_page_text=mugshot.get_text()
  next_page_url=mugshot.get('href')
  next_page_url=base_url+next_page_url
  print(next_page_url)
  parse_page(next_page_url)
else:
 export_table_and_print(data)
 parse_page(search_url)

关于如何使分页正常工作以及如何最终从我抓取的URL列表中获取数据的任何想法吗？

感谢您的帮助！我已经在python中工作了几个月了，但是由于某些原因，BS4和Scrapy的内容是如此令人困惑。

非常感谢您的社区！安娜

import requests from bs4 import BeautifulSoup from urllib.parse import urljoin url = "https://mugshots.com/" base = "https://mugshots.com" def get_next_pages(link): print("**"*20,"current page:",link) res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") for item in soup.select("[itemprop='name'] > a[href^='/Current-Events/']"): yield from get_main_content(urljoin(base,item.get("href"))) next_page = soup.select_one(".pagination > a:contains('Next')") if next_page: next_page = urljoin(url,next_page.get("href")) yield from get_next_pages(next_page) def get_main_content(link): res = requests.get(link) soup = BeautifulSoup(res.text,"lxml") item = soup.select_one("h1#item-title > span[itemprop='name']").text yield item if __name__ == '__main__': for elem in get_next_pages(url): print(elem)

美丽的汤分页，find_all在next_page类中找不到文本。还需要从URLS

MYLIFEA 回答：美丽的汤分页，find_all在next_page类中找不到文本。还需要从URLS

大家都在问