使用Selenium和Python处理超时

2024-05-03 • 问答

有人可以帮我吗？我已经编写了一个代码，使用Selenium从中文新闻网站上抓取文章。由于许多网址未加载，因此我尝试包含捕获超时异常的代码，该方法可以正常工作，但随后浏览器似乎停留在加载时超时的页面上，而不是尝试下一个网址。

我已经尝试在处理错误后添加driver.quit（）和driver.close（），但是在继续下一个循环时它不起作用。

with open('url_list_XB.txt','r') as f:
    url_list = f.readlines()

for idx,url in enumerate(url_list):
    status = str(idx)+" "+str(url)
    print(status)

    try:
        driver.get(url)
        try:
            tblnks = driver.find_elements_by_class_name("post_topshare_wrap")
            for a in tblnks:
                html = a.get_attribute('innerHTML')
                try:
                    link = re.findall('href="http://comment(.+?)" title',str(html))[0]
                    tb_link = 'http://comment' + link
                    print(tb_link)
                    ID = tb_link.replace("http://comment.tie.163.com/","").replace(".html","")
                    print(ID)
                    with open('tb_links.txt','a') as p:
                        p.write(tb_link + '\n')
                    try:
                        text = str(driver.find_element_by_class_name("post_text").text)
                        headline = driver.find_element_by_tag_name('h1').text
                        date = driver.find_elements_by_class_name("post_time_source")
                        for a in date:
                            date = str(a.text)
                            dt = date.split("　来源")[0]
                            dt2 = dt.replace(":","_").replace("-","_").replace(" ","_")

                        count = driver.find_element_by_class_name("post_tie_top").text

                        with open('SCS_DATA/' + dt2 + '_' + ID + '_INT_' + count + '_WY.txt','w') as d:
                            d.write(headline)
                            d.write(text + '\n')
                        path = 'SCS_DATA/' + ID
                        os.mkdir(path)

                    except NoSuchElementException as exception:
                        print("Element not found ")
                except IndexError as g:
                    print("Index Error")


            node = [url,tb_link]
            results.append(node)

        except NoSuchElementException as exception:
            print("TB link not found ")
        continue


    except TimeoutException as ex:
        print("Page load time out")

    except WebDriverException:
        print('WD Exception')

我希望代码在URL列表中移动，调用它们并获取文章文本以及指向讨论页面的链接。它会一直工作到加载页面超时为止，然后程序将无法继续运行。

我无法完全理解您的代码在做什么，因为我没有要执行的页面的上下文，但是我可以为您如何完成这样的工作提供一个通用的结构。这是我如何处理您的情况的简化版本：

useState

此代码导航到URL，并执行一些检查（由您指定）以查看链接是否“断开”。我们通过捕获引发的# iterate URL list for url in url_list: # navigate to a URL driver.get(url) # check something here to test if a link is 'broken' or not try: driver.find_element(someLocator) # if link is broken,go back except TimeoutException: driver.back() # continue so we can return to beginning of loop continue # if you reach this point,the link is valid,and you can 'do stuff' on the page来检查链接是否断开。如果抛出异常，我们将导航到上一页，然后调用TimeoutException返回到循环的开头，并从下一个URL重新开始。

如果我们通过continue / try块进行访问，则该URL有效，并且我们位于正确的页面上。在这个地方，您可以编写代码来抓取文章或您需要执行的任何操作。

仅当未遇到except时，才会显示try / except之后出现的代码-这表示该URL有效。

使用Selenium和Python处理超时

ma_jing2004 回答：使用Selenium和Python处理超时

大家都在问