刮除bs4和硒,每个循环返回相同的数据

我对Web抓取还很陌生,并且正在尝试从timeanddate.com抓取回溯的数据并将其输出到csv。我正在使用Selenium获取每个日期的数据表。 我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

def getData (url,month,year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  Data = []
  soup = BeautifulSoup(driver.page_source,"lxml")
  for i in driver.find_element_by_id("wt-his-select").find_elements_by_tag_name("option"):
    i.click()
    table = soup.find('table',attrs={'id':'wt-his'})
    for tr in table.find('tbody').find_all('tr'):
       dict = {}
       dict['time'] = tr.find('th').text.strip()
       all_td = tr.find_all('td')
       dict['humidity'] = all_td[5].text
       Data.append(dict)

    fileName = "output_month="+month+"_year="+year+".csv"
    keys = Data[0].keys()
    with open(fileName,'w') as result:
      dictWriter = csv.DictWriter(result,keys)
      dictWriter.writeheader()
      dictWriter.writerows(Data)

year_num = int(input("Enter your year to collect data from: "))
month_num = 1
year = str(year_num)
for i in range (0,12):
  month = str(month_num)
  url = "https://www.timeanddate.com/weather/usa/new-york/historic?month="+month+"&year="+year
  data = getData(url,year)
  print (data)
  month_num += 1

我要从中抓取数据的表是weather data,我想获取该月每一天的湿度数据。

程序在月份中循环,但是输出是1月1日星期一的数据。尽管日期在浏览器中发生了变化,但每次(current output)都会将相同的数据附加到文件中,而不是每次添加新的数据。附加的日期(desired output)。我不知道为什么要这样做,对解决它的任何帮助将不胜感激。

guang6883 回答:刮除bs4和硒,每个循环返回相同的数据

问题在于,即使网站随每个日期选择而变化,您也只能解析该网站一次。但是,仅将解析移到for循环内是不够的,因为还需要等到页面加载后才能开始重新解析。

以下是我的解决方案。有两件事要注意:

  1. 我正在使用Selenium内置提供的WebDriverWaitexpected_conditions
  2. 我更喜欢通过CSS选择器进行查找,这大大简化了语法。 This awesome game可以帮助您学习它们
# Necessary imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def getData (url,month,year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  wait = WebDriverWait(driver,5);
  Data = []
  for opt in driver.find_elements_by_css_selector("#wt-his-select option"):
    opt.click()
    # wait until the table title changes to selected date
    wait.until(EC.text_to_be_present_in_element((By.ID,'wt-his-title'),opt.text))
    for tr in driver.find_elements_by_css_selector('#wt-his tbody tr'):
       dict = {}
       dict['time'] = tr.find_element_by_tag_name('th').text.strip()
       # Note that I replaced 5 with 6 as nth-of-xxx starts indexing from 1
       dict['humidity'] = tr.find_element_by_tag_name('td:nth-of-type(6)').text.strip()
       Data.append(dict)
       # continue with csv handlers ...

本文链接:https://www.f2er.com/3089356.html

大家都在问