我正在尝试从以下 URL 中抓取以从 WorldOMeter 获取 CoVid 数据,并且在此页面上存在一个表,其 ID 为:main_table_countries_today
,其中包含 15x225 (3,375) 个数据单元格想聚会。
我尝试了几种方法,但让我分享一下我认为最接近的尝试:
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Refreshes the Terminal Emulator window
def clear_screen():
def bash_input(user_in):
_ = system(user_in)
bash_input('clear')
# This bot searches for <table> and </table> to start/stop recording data
class Bot:
def __init__(self,line_added=False,looking_for_start=True,looking_for_end=False):
self.line_adding = line_added
self.looking_for_start = looking_for_start
self.looking_for_end = looking_for_end
def set_line_adding(self,bool):
self.line_adding = bool
def set_start_look(self,bool):
self. looking_for_start = bool
def set_end_look(self,bool):
self.looking_for_end = bool
if __name__ == '__main__':
# Start with a fresh Terminal emulator
clear_screen()
my_bot = Bot()
r = requests.get(url).text
all_r = r.split('\n')
for rs in all_r:
if my_bot.looking_for_start and table_id in rs:
my_bot.set_line_adding(True)
my_bot.set_end_look(True)
my_bot.set_start_look(False)
if my_bot.looking_for_end and table_end in rs:
my_bot.set_line_adding(False)
my_bot.looking_for_end(False)
if my_bot.line_adding:
all_lines.append(rs)
for lines in all_lines:
print(lines)
print('\n\n\n\n')
print(len(all_lines))
这将打印 6,551 行代码,这是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前使用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有从上述表格开始和停止。它看起来像这样:
from bs4 import BeautifulSoup
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Declare an empty list to fill with lines of text
all_lines = list()
if __name__ == '__main__':
# Here we go,again...
_ = system('clear')
r = requests.get(url).text
soup = BeautifulSoup(r)
my_table = soup.find_all('table',{'id': table_id})
for current_line in my_table:
page_lines = str(current_line).split('\n')
for line in page_lines:
all_lines.append(line)
for line in all_lines:
print(line)
print('\n\n')
print(len(all_lines))
结果产生了 5,547 行。
我也尝试过使用 Pandas 和 Selenium,但我已经刮掉了那些代码。我希望通过展示我的两次“最佳”尝试,有人可能会发现我遗漏的一些明显问题。
如果我能在屏幕上看到数据,我会很高兴。我最终试图将这些数据转换为看起来像这样的字典(将导出为 .json
文件):
data = {
"Country": [country for country in countries],"Total Cases": [case for case in total_cases],"New Cases": [case for case in new_cases],"Total Deaths": [death for death in total_deaths],"New Deaths": [death for death in new_deaths],"Total Recovered": [death for death in total_recovered],"New Recovered": [death for death in new_recovered],"active Cases": [case for case in active_cases],"Serious/Critical": [case for case in serious_critical],"Total Cases/1M pop": [case for case in total_case_per_million],"Deaths/1M pop": [death for death in deaths_per_million],"Total Tests": [test for test in total_tests],"Tests/1M pop": [test for test in tests_per_million],"Population": [population for population in populations]
}
有什么建议吗?