我正在从Wikipedia收集一些文章(几十到数百个,注意Wikipedia API的礼节性限制)。
所有文章都是品牌,在很多情况下,关键字可以是非常通用的,而不仅指品牌。我得到其他建议,如:
Arla可能指:
- Arla(文件系统)
- 飞蛾属Arla(飞蛾)
- 阿肯色州图书馆协会
- 希腊Arla,一个村庄\ n \ u00c4rla,东南部的一个村庄 瑞典
- 斯堪的纳维亚的大型生产商Arla Foods ...
我想找出属于“品牌类别”的那个,但我也可以放入其他相关关键字,例如“食品或饮料”
我可以使用Wikipedia API提取包含某些关键字的命题吗?
问题在于,当存在歧义时,响应JSON的形式与找到一篇文章的方式相同。
检查我的脚本:
import requests
import time
result = {}
for q in spotted_keywords:
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exintro&explaintext&format=json&exintro=&titles='+q+'&redirects=true'
r = requests.get(url)
json_data = r.json()
extract = list(json_data['query']['pages'].values())[0]
if('extract' in extract):
result[q] = extract['extract']
time.sleep(1)
spotted_keywords类似于["mcdonalds","cocacola" ...]
一个响应是:
{
"batchcomplete":"","query":{
"normalized":[
{
"from":"arla","to":"Arla"
}
],"pages":{
"360264":{
"pageid":360264,"ns":0,"title":"Arla","extract":"Arla may refer to:\n\nArla (file system)\nArla (moth),a genus of moth\nArkansas library Association\nArla,Greece,a village\n\u00c4rla,a village in south-eastern Sweden\nArla Foods,a large Scandinavian producer of dairy products\nArla (Finland),a subsidiary of Arla Foods\nArla Foods UK,a subsidiary of Arla Foods\nARLA,Arm\u00e9e r\u00e9volutionnaire de lib\u00e9ration de l'Azawad (French),Revolutionary Liberation Army of Azawad"
}
}
}
}
有任何提示吗?