数据框未显示整个表格

2024-05-19 • 问答

我不知道为什么它只显示数据框中的最后一列而不是 beautified_value 中的所有行

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.worldometers.info/world-population/population-by-country/'

output = requests.get(url)
soup = BeautifulSoup(output.text,'html.parser')

table = soup.find_all('table')
table = table[0]
columns = []

header_tags = table.find_all('th')

headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all('tr')

for row in data_rows:
   value = row.find_all('td')
   beautified_value = [dp.text.strip() for dp in value]
#print(beautified_value)

df = pd.DataFrame(data=[beautified_value],columns=[headers])

您不是将值附加到 beautified_value，只是一遍又一遍地重写它。您可以使用 list.append，例如：

from bs4 import BeautifulSoup
import requests
import pandas as pd

url = "https://www.worldometers.info/world-population/population-by-country/"

output = requests.get(url)
soup = BeautifulSoup(output.text,"html.parser")

table = soup.find("table")
columns = []

header_tags = table.find_all("th")

headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all("tr")[1:]

beautified_value = []
for row in data_rows:
    value = row.find_all("td")
    beautified_value.append(dp.text.strip() for dp in value)

df = pd.DataFrame(data=beautified_value,columns=headers)
print(df)

打印：

       #   Country (or dependency) Population (2020) Yearly Change  Net Change Density (P/Km²) Land Area (Km²) Migrants (net) Fert. Rate Med. Age Urban Pop % World Share
0      1                     China     1,439,323,776        0.39 %   5,540,090             153       9,388,211       -348,399        1.7       38        61 %     18.47 %
1      2                     India     1,380,004,385        0.99 %  13,586,631             464       2,973,190       -532,687        2.2       28        35 %     17.70 %
2      3             United States       331,002,651        0.59 %   1,937,734              36       9,147,420        954,806        1.8       38        83 %      4.25 %

...

使用read_html。并不是说我必须使用请求手动设置用户代理，否则会抛出 403 错误：

import requests
import pandas as pd

df = pd.read_html(requests.get(url,headers={'User-agent': 'Mozilla/5.0'}).text)[0]

	#	国家（或依赖）	人口（2020 年）	逐年变化	净变化	密度（P/Km²）	土地面积（平方公里）	移民（净）	Fert。速率	医学。年龄	城市流行%	世界分享
0	1	中国	1439323776	0.39%	5540090	153	9388211	-348399	1.7	38	61%	18.47%
1	2	印度	1380004385	0.99%	13586631	464	2973190	-532687	2.2	28	35%	17.70%
2	3	美国	331002651	0.59%	1937734	36	9147420	954806	1.8	38	83%	4.25%
3	4	印度尼西亚	273523615	1.07%	2898047	151	1811570	-98955	2.3	30	56%	3.51%
4	5	巴基斯坦	220892340	2.00%	4327022	287	770880	-233379	3.6	23	35%	2.83%

数据框未显示整个表格

kangbowen004 回答：数据框未显示整个表格

大家都在问