我正在使用BeautifulSoup从一堆Wikipedia页面上抓取一些基本信息。该程序运行,但是很慢(650页大约20分钟)。我正在尝试使用多处理来加速此过程,但是它没有按预期工作。它要么似乎被拖延了,什么也没做,要么只是刮擦了每个页面名称的首字母。
我正在使用的抓取代码是:
#dict where key is person's name and value is proper wikipedia url formatting
all_wikis = { 'Adam Ferrara': 'Adam_Ferrara','Adam Hartle': 'Adam_Hartle','Adam Ray': 'Adam_Ray_(comedian)','Adam Sandler': 'Adam_Sandler','Adele Givens': 'Adele_Givens'}
bios = {}
def scrape(dictionary):
for key in dictionary:
#search each page
page = requests.get(("https://en.wikipedia.org/wiki/" + str(key)))
data = page.text
soup = BeautifulSoup(data,"html.parser")
#get data
try:
bday = soup.find('span',attrs={'class' : 'bday'}).text
except:
bday = 'Birthday Unknown'
try:
birthplace = soup.find('div',attrs={'class' : 'birthplace'}).text
except:
birthplace = 'Birthplace Unknown'
try:
death_date = (soup.find('span',attrs={'style' : "display:none"}).text
.replace("(","")
.replace(")",""))
living_status = 'Deceased'
except:
living_status = 'Alive'
try:
summary = wikipedia.summary(dictionary[key].replace("_"," "))
except:
summary = "No Summary"
bios[key] = {}
bios[key]['birthday'] = bday
bios[key]['home_town'] = birthplace
bios[key]['summary'] = summary
bios[key]['living_status'] = living_status
bios[key]['passed_away'] = death_date
我尝试使用下面的代码在脚本末尾添加处理功能,但它不起作用或仅提取每个页面的首字母(例如,如果我要搜索的页面是李小龙,而是将维基百科页面上的字母B拉起来,然后抛出一堆错误)。
from multiprocessing import Pool,cpu_count
if __name__ == '__main__':
pool = Pool(cpu_count())
results = pool.map(func=scrape,iterable=all_wiki)
pool.close()
pool.join()
是否有更好的方法来构造脚本以进行多处理?