在 Python 中抓取 <table>TABLE I NEED</table> 之间的所有文本

2024-05-18 • 问答

我正在尝试从以下 URL 中抓取以从 WorldOMeter 获取 CoVid 数据，并且在此页面上存在一个表，其 ID 为：main_table_countries_today，其中包含 15x225 (3,375) 个数据单元格想聚会。

我尝试了几种方法，但让我分享一下我认为最接近的尝试：

import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'


# Refreshes the Terminal Emulator window
def clear_screen():

    def bash_input(user_in):
        _ = system(user_in)
    
    bash_input('clear')


# This bot searches for <table> and </table> to start/stop recording data
class Bot:

    def __init__(self,line_added=False,looking_for_start=True,looking_for_end=False):

        self.line_adding = line_added
        self.looking_for_start = looking_for_start
        self.looking_for_end = looking_for_end
    
    def set_line_adding(self,bool):

        self.line_adding = bool

    def set_start_look(self,bool):

        self. looking_for_start = bool

    def set_end_look(self,bool):

        self.looking_for_end = bool


if __name__ == '__main__':

    # Start with a fresh Terminal emulator
    clear_screen()
    
    my_bot = Bot()

    r = requests.get(url).text
    all_r = r.split('\n')

    for rs in all_r:

        if my_bot.looking_for_start and table_id in rs:
                
            my_bot.set_line_adding(True)
            my_bot.set_end_look(True)
            my_bot.set_start_look(False)
        
        if my_bot.looking_for_end and table_end in rs:    
                
            my_bot.set_line_adding(False)
            my_bot.looking_for_end(False)
        
        if my_bot.line_adding:

            all_lines.append(rs)
        

        for lines in all_lines:
            print(lines)
        
        print('\n\n\n\n')
        print(len(all_lines))

这将打印 6,551 行代码，这是我需要的两倍多。这通常没问题，因为下一步是清理与我的数据无关的行，但是，这不会产生整个表。我之前使用 BeautifulSoup 进行的另一次尝试（非常相似的过程）也没有从上述表格开始和停止。它看起来像这样：

from bs4 import BeautifulSoup
import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'

# Declare an empty list to fill with lines of text
all_lines = list()


if __name__ == '__main__':

    # Here we go,again...
    _ = system('clear')

    r = requests.get(url).text
    soup = BeautifulSoup(r)
    my_table = soup.find_all('table',{'id': table_id})

    for current_line in my_table:

        page_lines = str(current_line).split('\n')

        for line in page_lines:
            all_lines.append(line)

    for line in all_lines:
        print(line)

    print('\n\n')
    print(len(all_lines))

结果产生了 5,547 行。

我也尝试过使用 Pandas 和 Selenium，但我已经刮掉了那些代码。我希望通过展示我的两次“最佳”尝试，有人可能会发现我遗漏的一些明显问题。

如果我能在屏幕上看到数据，我会很高兴。我最终试图将这些数据转换为看起来像这样的字典（将导出为 .json 文件）：

data = {
    "Country": [country for country in countries],"Total Cases": [case for case in total_cases],"New Cases": [case for case in new_cases],"Total Deaths": [death for death in total_deaths],"New Deaths": [death for death in new_deaths],"Total Recovered": [death for death in total_recovered],"New Recovered": [death for death in new_recovered],"active Cases": [case for case in active_cases],"Serious/Critical": [case for case in serious_critical],"Total Cases/1M pop": [case for case in total_case_per_million],"Deaths/1M pop": [death for death in deaths_per_million],"Total Tests": [test for test in total_tests],"Tests/1M pop": [test for test in tests_per_million],"Population": [population for population in populations]
}

有什么建议吗？

syf19967709 回答：在 Python 中抓取 <table>TABLE I NEED</table> 之间的所有文本

该表包含许多其他信息。您可以连续获取前 15 个 <td> 单元格并去除前 8 行/后 8 行：

import requests
import pandas as pd
from bs4 import BeautifulSoup


url = "https://www.worldometers.info/coronavirus/"
soup = BeautifulSoup(requests.get(url).content,"html.parser")

all_data = []
for tr in soup.select("#main_table_countries_today tr:has(td)")[8:-8]:
    tds = [td.get_text(strip=True) for td in tr.select("td")][:15]
    all_data.append(tds)

df = pd.DataFrame(
    all_data,columns=[
        "#","Country","Total Cases","New Cases","Total Deaths","New Deaths","Total Recovered","New Recovered","Active Cases","Serious,Critical","Tot Cases/1M pop","Deaths/1M pop","Total Tests","Tests/1M pop","Population",],)
print(df)

打印：

       #                 Country Total Cases New Cases Total Deaths New Deaths Total Recovered New Recovered Active Cases Serious,Critical Tot Cases/1M pop Deaths/1M pop  Total Tests Tests/1M pop     Population
0      1                     USA  35,745,024                629,315                 29,666,117                  5,449,592            11,516          107,311         1,889  529,679,820    1,590,160    333,098,437
1      2                   India  31,693,625   +39,041      424,777       +393      30,846,509       +33,636      422,339             8,944           22,725           305  468,216,510      335,725  1,394,642,466
2      3                  Brazil  19,917,855                556,437                 18,619,542                    741,876             8,318           92,991         2,598   55,034,721      256,943    214,190,490
3      4                  Russia   6,288,677   +22,804      159,352       +789       5,625,890       +17,271      503,435             2,300           43,073         1,091  165,800,000    1,135,600    146,002,094

...

218  219                   Samoa           3                                                 3                          0                                 15                                                199,837
219  220            Saint Helena           2                                                 2                          0                                328                                                  6,097
220  221              Micronesia           1                                                 1                          0                                  9                                                116,324
221  222                   China      93,005       +75        4,636                     87,347           +24        1,022                25               65             3  160,000,000      111,163  1,439,323,776

这是你可以尝试的，你可以在代码中找到基本的解释：

from bs4 import BeautifulSoup
import requests
import pandas as pd

page = requests.get('https://www.worldometers.info/coronavirus')
soup = BeautifulSoup(page.content,"lxml")

table = soup.find('table',attrs={'id': 'main_table_countries_today'})
# Finding table using id

trs = table.find_all("tr",attrs={"style": ""})
# Finding tr from table using style attribute

data = []
data.append(trs[0].text.strip().split("\n")[:13])
# Appending first element of trs to data(list)

for tr in trs[1:]:
    data.append(tr.text.strip().split("\n")[:12])
    # Appending all other data from tr in data(list)

df = pd.DataFrame(data[1:],columns=data[0][:12])
# Converting data into pandas DataFrame and specifying header name from first row of data.

print(df)
"""
          #          Country,Other  TotalCases   NewCases TotalDeaths  \
0     World            198,878,345    +370,787  4,238,503      +6,065
1         1                    USA  35,024               629,315
2         2                  India  31,625    +39,041    424,777
3         3                 Brazil  19,855               556,437
4         4                 Russia   6,677    +22,804    159,352
..      ...                    ...         ...        ...         ...
208     211  Saint Pierre Miquelon          28
209     213             Montserrat          21                     1
210     215         Western Sahara          10                     1
211     222                  China      93,005        +75      4,636
212  Total:            198,065

       NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical  \
0    179,521,450       +271,005   15,118,392      90,326           25,514
1                    29,117                5,592           11,516
2           +393     30,509      +33,636     422,339            8,944
3                    18,542                  741,876            8,318
4           +789      5,890      +17,271     503,435            2,300
..           ...            ...          ...         ...              ...
208                          26                        2
209                          19                        1
210                           8                        1
211                      87,347          +24       1,022               25
212  179,326         25,514.2

    Tot Cases/1M pop Deaths/1M pop
0              543.8
1            107,889
2             22,725           305
3             92,598
4             43,091
..               ...           ...
208            4,859
209            4,204           200
210               16             2
211               65             3
212            543.8

[213 rows x 12 columns]
"""
# If you don't need pandas index then you can try this :
df.reset_index(inplace=True)
# And to set # column index :
df.set_index("#",inplace=True)

# Now we got complete data,if we want we may save it in file as well 

pd.to_csv("<>.csv",index=False)
# or if excel 
pd.to_excel("<>.xlsx",index=False)

你得到“5,547”行结果，因为有很多空行和一些不必要的行，所以它变得那么大。这减少了您像 data 字典那样的手动工作，现在您不必一一写列名。

beautifulsoup python-3.x python-requests web-scraping

本文链接：https://www.f2er.com/1186.html

在 Python 中抓取 <table>TABLE I NEED</table> 之间的所有文本

syf19967709 回答：在 Python 中抓取 <table>TABLE I NEED</table> 之间的所有文本

大家都在问