我是python的新手,我制作了一个网络抓取脚本,当im最多使用80个网址时,它可以正常工作。
我添加了一个进度条,它显示了95个url循环后脚本被卡住的情况。我尝试使用列表中的其他网站URL进行搜索,但仍然卡住。
有人对此有解决方案吗?
以下是脚本:
data = pd.read_excel(r'Z:\001\input.xlsx')
urllist = data['IDE_WEBSITE'].tolist()
wordlist = ('woord1','woord2')
results = []
errors = []
for word in wordlist:
for url in tqdm(urllist):
try:
r = requests.get(url,allow_redirects=False)
soup = BeautifulSoup(r.content.lower(),'lxml')
words = soup.find_all(text=lambda text: text and word.lower() in text)
count = len(words)
time.sleep(1)
if count > 0:
result = {'url': url,'count': count,'the_word': word}
results.append(result)
except ConnectionError:
error1 = {'url': url,'error': 'connection_error'}
errors.append(error1)
urllist.remove(url)
continue
except BaseHTTPError:
error2 = {'url': url,'error': 'base_error'}
errors.append(error2)
urllist.remove(url)
continue
except ChunkedEncodingError:
error3 = {'url': url,'error': 'encoding_error'}
errors.append(error3)
urllist.remove(url)
continue
except MissingSchema:
urllist.remove(url)
continue
df_errors = pd.DataFrame(errors)
print(errors)
print(urllist)
df_results = pd.DataFrame(results)
df_results.to_excel(r'Z:\001_Personal\results_nmb_run3.xlsx',index=None,header=True)
df_errors.to_excel(r'Z:\001_Personal\errors_nmb_run3.xlsx',header=True)