我想抓取可以更新的数据。当我要计划搜寻器每次运行时仅搜寻更新的零件时,该如何处理?我正在使用beautifulsoup。
liufei0225 回答:如何在每次运行时仅对更新的零件进行爬网
要仅隔离新的过帐,您需要找到某种主键,保证该主键对于每个结果都是唯一的。这可以是url,内部ID等。一旦找到唯一值,就需要在会话之间存储它。对于快速查找的东西,我将使用一组。这是一个基本示例,可以使用集和内置的pickle模块维护您在会话之间看到的结果。
我们将假定您已将结果隔离在列表elements
中,并且每个元素都是一个带有<a>
标签且具有唯一href
值的BeautifulSoup对象。
import os
import pickle
if not os.path.exists('seen.pickle'): # If the seen file doesnt exist,create it.
with open('seen.pickle','wb') as file:
pickle.dump(set(),file)
with open('seen.pickle','rb') as file: # Load the seen file so we can reference it.
seen = pickle.load(file)
for element in elements:
if not element['href'] in seen:
# Your own code here
seen.add(element['href']) # Add it to seen so it is never parsed again
with open('seen.pickle','wb') as file: # Save the seen file.
pickle.dump(seen,file)