使用Python和Beautiful Soup抓取Google新闻结果只会检索没有标题的第一页

2024-05-19 • 问答

我想根据搜索到的术语从Google新闻搜索页面中抓取标题和段落文本。我想在前 n 页中这样做。

我已经编写了一段仅用于抓取第一页的代码，但是我不知道如何修改url以便可以转到其他页面（第2、3 ...页）。那是我遇到的第一个问题。

第二个问题是我不知道该如何抓标题。它总是返回我空白列表。我尝试了多种解决方案，但始终会返回空白列表。（我认为该页面不是动态的。）

另一方面，在标题下方抓取段落文本效果很好。你能告诉我如何解决这两个问题吗？

这是我的代码：

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1,how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')

# I think that this is not javascipt sensitive,its not dynamic            
headline_results = soup.find_all('a',class_="l lLrAF")
#headline_results = soup.find_all('h3',class_="r dO0Ag") # also does not work
print(headline_results) #empty list,IDK why?

paragraph_results = soup.find_all('div',class_='st')
print(paragraph_results) # works

问题一：翻转页面。

要转到下一页，您需要在URL格式的字符串中包含start关键字：

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term,(page - 1) * 10
)

问题二：取消标题。

Google重新生成DOM元素的类，ID等的名称，因此，每次检索一些新的未缓存信息时，您的方法很可能会失败。

只需在搜索词中添加参数“ start = 10”。喜欢： https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

对于动态行为/响应页面上的循环，请使用以下内容：

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0,page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term,10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text,'html.parser')

Link 到我之前回答的部分相同的问题。

或者，您可以使用来自 SerpApi 的 Google News Result API。这是一个免费试用的付费 API。

部分 JSON 输出：

"news_results": [
  {
    "position": 1,"link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html","title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies","source": "St. Louis Post-Dispatch","date": "1 week ago","snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...","thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
  }
]

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
  "engine": "google","q": "best cookies","tbm": "nws","start": "10","api_key": os.getenv("API_KEY"),}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\n")

输出：

Title: 10 Of The Absolute Best Cookies In Sydney
    
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality

Title: Family cookies by Taimur Ali Khan is the best thing on internet

Title: Gibson Dunn Ranked Among Top Three Firms for Client ...

Title: Livingston CARES: Saying thank you to one cookie at a time

Title: Google's plan to replace cookies is the web's best hope for a more private internet

Title: The 12 Best Cookies in NYC

Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...

Title: Best Cookie Delivery Services - Where to Order Cookies Online

Title: How to make the best cookies for the holidays

免责声明，我为 SerpApi 工作。

使用Python和Beautiful Soup抓取Google新闻结果只会检索没有标题的第一页

zhchea 回答：使用Python和Beautiful Soup抓取Google新闻结果只会检索没有标题的第一页

大家都在问