更改URL字符串中的单个值

2024-05-17 • 问答

我正在学习网络抓取，并且正在example.webscraping.com上进行练习。我能够从单个页面中提取所需的信息，但是我想知道如何以最简单的方式遍历多个页面。我采用的方法只是使用格式化的字符串，因为页面之间的唯一区别是URL“ http://example.webscraping.com/places/default/index/1”末尾的值。

但是，即使通过创建一个整数并尝试将其作为字符串输入到URL上（具有一个计数器）来在每个完整的循环周期后更改url时，我也没有任何运气。我意识到这可能不是普遍接受的方式，但是除了创建字典并尝试这种方式外，我不知道其他任何方式，但是好像打开了另一罐蠕虫。任何建议都很高兴被接受，很抱歉，如果以前已经讨论过，但是我发现的所有帖子对于初学者来说都太复杂了，以至于我无法理解。另外，webloop是我使用for循环从站点中提取我想要的所有数据的功能。到目前为止，谢谢您的建议。如果我想浏览具有相同请求的ebay和amazon之类的站点，并在每次循环后都进行更改，以供将来参考，最好的方法是什么？

from bs4 import BeautifulSoup as smoothie
def webloop():
    for results in soup.find('div',id='results').find_all('div'):      
      country = results.a.text
      flag = results.a.img
      print (country)
      print(flag)
      print()

pagenum = 1

while pagenum != 4:
    source =requests.get('http://example.webscraping.com/places/default/index/pagenum=%s').text %(str(pagenum))
    soup = smoothie(source,'html.parser')
    webloop()
    pagenum += 1

我打算让循环在每次迭代时都请求一个新页面，但出现此错误ValueError：索引2159处的格式字符'Y'（0x59）不支持

我不确定这是否是您的问题，但是问题中的代码在错误的位置设置了字符串格式。而且我不知道webloop()在这里吗？另外，该网络抓取示例页面的页码似乎始于0，而不是1。

from bs4 import BeautifulSoup as smoothie

pagenum = 1

while pagenum != 4:
    source =requests.get('http://example.webscraping.com/places/default/index/pagenum={}'.format(pagenum))
    soup = smoothie(source.text,'html.parser')
    webloop()
    pagenum += 1

字符串格式存在问题，在Python 3.6中，您可以使用更具可读性的 f-strings

source = requests.get(f'http://example.webscraping.com/places/default/index/pagenum={pagenum}')

还有其他问题，例如source.text而不是source以及功能缺失（在您发布的代码中）webloop()。

from bs4 import BeautifulSoup as smoothie
import requests

for pagenum in range(1,4):
    source = requests.get(f'http://example.webscraping.com/places/default/index/pagenum={pagenum}')
    soup = smoothie(source.text,'html.parser')
    webloop()

更改URL字符串中的单个值

bonaxi 回答：更改URL字符串中的单个值

大家都在问