Python中的Web抓取（beautifulsoup）

2024-05-09 • 问答

我正在尝试进行网上抓取，并且目前停留在如何继续执行代码上。我正在尝试创建一个刮擦前80个Yelp的代码！评论。由于每页只有20条评论，因此我也想弄清楚如何创建一个循环来将网页更改为下20条评论。

from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
    url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
    # get webpage data from url
    response = requests.get(url)
    #sleep for 2 seconds
    time.sleep(2)
    # get html document from web page data
    html_doc = response.text
    # parser
    soup = BeautifulSoup(html_doc,"lxml")
    page_title = soup.title.text
    #get a tag content based on class
    p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
    #print the text within the tag
    return p_tag.text

一般说明/提示：在要抓取的页面上使用“检查”工具。

对于您的问题，如果您访问网站并解析BeautifulSoup，然后在函数中使用汤对象，它也会变得更好用-访问一次，根据需要进行多次解析。这样，您就不会被网站列入黑名单。下面是一个示例结构。

url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc,"lxml")
get_description(soup)
get_reviews(soup)

如果您检查页面，则每个评论将显示为模板的副本。如果将每个评论作为一个单独的对象并进行解析，则可以获得所需的评论。审查模板的类为：柠檬--li__373c0__1r9wz u-space-b3 u-padding-b3边框-底部__373c0__uPbXS边框颜色-默认__373c0__2oFDT

对于分页，分页号包含在带有class =“ lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j”的模板中的

单个页码链接包含在a-href标记内，因此只需编写一个for循环即可遍历链接。

要获取下一页，您将必须单击“下一步”链接。这里的问题是链接与之前加#相同。打开检查器[Chrome，Firefox中的Ctrl-Shift-I]并切换到“网络”标签，然后单击下一步按钮，您将看到类似以下内容的请求：

https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40

类似于：

{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place ...

这是JSON。唯一的问题是，您需要通过向Yelp的服务器发送标头来欺骗Yelp的服务器，使其认为自己正在浏览网站，否则您将获得不像注释一样的其他数据。

它们在Chrome中看起来像这样

我通常的方法是将不带冒号的标头直接复制粘贴（忽略:authority等）到称为raw_headers的三引号字符串中，然后运行

headers = dict([[h.partition(':')[0],h.partition(':')[2]] for h in raw_headers.split('\n')])

通过它们，然后将它们作为参数传递给具有以下要求的请求：

requests.get(url,headers=headers)

某些标头不是必需的，cookie可能会过期，并且可能会出现其他各种问题，但这至少为您带来了战斗的机会。

Python中的Web抓取（beautifulsoup）

shilaoban2 回答：Python中的Web抓取（beautifulsoup）

大家都在问