网页抓取问题 - 某些字符无法解码,并被替换为 REPLACEMENT CHARACTER

我试图用 urllib 和 beautifulsoup (python 3.9) scrape 一个网站,但我仍然有相同的错误消息“某些字符无法解码,并被替换为 REPLACEMENT CHARactER”和特殊字符如下:

��T�w?.��m����%�%z��%�H=S��$S�YYyi�ABD�x�!%��f36��\�Y� j�46f����I��9��!D��������������������b7�3�8��JnH�t���mړBm。 ��[�7X�?NF4r���[k��6�X?��VV��H�J$j�6h�e�C��]

我阅读了一些有关此问题的主题,但在我的案例中没有找到解决方案。 下面,我的代码:

url = "https://www.fnac.com"
hdr = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0","accept": "*/*","accept-Encoding" : "gzip,deflate,br","accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3","Connection" : "keep-alive"}
req = urllib.request.Request(url,headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page,"html.parser",from_encoding="utf-8")
    #divs = soup.findAll('div')
    #href = [i['href'] for i in soup.findAll('a',href=True)]
    print(soup)

else:
    print("failed!")

我尝试通过 ASCII 或 iso-8858-(1...9) 更改编码模式,但问题仍然存在。

感谢您的帮助:)

ramzismo 回答:网页抓取问题 - 某些字符无法解码,并被替换为 REPLACEMENT CHARACTER

从 HTTP 标头中删除 Accept-Encoding

import urllib
from bs4 import BeautifulSoup

url = "https://www.fnac.com"
hdr = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0","Accept": "*/*",# "Accept-Encoding": "gzip,deflate,br","Accept-Language": "fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3","Connection": "keep-alive",}
req = urllib.request.Request(url,headers=hdr)

page = urllib.request.urlopen(req)

if page.getcode() == 200:
    soup = BeautifulSoup(page,"html.parser",from_encoding="utf-8")
    # divs = soup.findAll('div')
    # href = [i['href'] for i in soup.findAll('a',href=True)]
    print(soup)

else:
    print("failed!")

打印:


<!DOCTYPE html>

<html class="no-js" lang="fr-FR">
<head><meta charset="utf-8"/> <!-- entry: inline-kameleoon -->


...
本文链接:https://www.f2er.com/704.html

大家都在问