有人可以帮我吗?我已经编写了一个代码,使用Selenium从中文新闻网站上抓取文章。由于许多网址未加载,因此我尝试包含捕获超时异常的代码,该方法可以正常工作,但随后浏览器似乎停留在加载时超时的页面上,而不是尝试下一个网址。
我已经尝试在处理错误后添加driver.quit()和driver.close(),但是在继续下一个循环时它不起作用。
with open('url_list_XB.txt','r') as f:
url_list = f.readlines()
for idx,url in enumerate(url_list):
status = str(idx)+" "+str(url)
print(status)
try:
driver.get(url)
try:
tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
for a in tblnks:
html = a.get_attribute('innerHTML')
try:
link = re.findall('href="http://comment(.+?)" title',str(html))[0]
tb_link = 'http://comment' + link
print(tb_link)
ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
print(ID)
with open('tb_links.txt','a') as p:
p.write(tb_link + '\n')
try:
text = str(driver.find_element_by_class_name("post_text").text)
headline = driver.find_element_by_tag_name('h1').text
date = driver.find_elements_by_class_name("post_time_source")
for a in date:
date = str(a.text)
dt = date.split(" 来源")[0]
dt2 = dt.replace(":","_").replace("-","_").replace(" ","_")
count = driver.find_element_by_class_name("post_tie_top").text
with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt','w') as d:
d.write(headline)
d.write(text + '\n')
path = 'SCS_DATA/' + ID
os.mkdir(path)
except NoSuchElementException as exception:
print("Element not found ")
except IndexError as g:
print("Index Error")
node = [url,tb_link]
results.append(node)
except NoSuchElementException as exception:
print("TB link not found ")
continue
except TimeoutException as ex:
print("Page load time out")
except WebDriverException:
print('WD Exception')
我希望代码在URL列表中移动,调用它们并获取文章文本以及指向讨论页面的链接。它会一直工作到加载页面超时为止,然后程序将无法继续运行。