我对Web抓取还很陌生,并且正在尝试从timeanddate.com抓取回溯的数据并将其输出到csv。我正在使用Selenium获取每个日期的数据表。 我的代码:
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
def getData (url,month,year):
driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe')
driver.get(url)
Data = []
soup = BeautifulSoup(driver.page_source,"lxml")
for i in driver.find_element_by_id("wt-his-select").find_elements_by_tag_name("option"):
i.click()
table = soup.find('table',attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
dict = {}
dict['time'] = tr.find('th').text.strip()
all_td = tr.find_all('td')
dict['humidity'] = all_td[5].text
Data.append(dict)
fileName = "output_month="+month+"_year="+year+".csv"
keys = Data[0].keys()
with open(fileName,'w') as result:
dictWriter = csv.DictWriter(result,keys)
dictWriter.writeheader()
dictWriter.writerows(Data)
year_num = int(input("Enter your year to collect data from: "))
month_num = 1
year = str(year_num)
for i in range (0,12):
month = str(month_num)
url = "https://www.timeanddate.com/weather/usa/new-york/historic?month="+month+"&year="+year
data = getData(url,year)
print (data)
month_num += 1
我要从中抓取数据的表是weather data,我想获取该月每一天的湿度数据。
程序在月份中循环,但是输出是1月1日星期一的数据。尽管日期在浏览器中发生了变化,但每次(current output)都会将相同的数据附加到文件中,而不是每次添加新的数据。附加的日期(desired output)。我不知道为什么要这样做,对解决它的任何帮助将不胜感激。