带有字符串详细信息的水果列表。我想优化核心信息,即字符串中的水果名称。
其中一些名称中有1个字(例如Apple,Grape),有2个名称(西瓜,生果)。
我尝试了ngram方式:
from nltk import ngrams
from collections import Counter
strings = [
"Apples Fresh Golden Delicious","Apples Fresh Red Delicious 12","Apple Sliced 12","Apple Diced 24","Water melon Fresh Petite Green on the turn","16 count Star fruit","8 count Star fruit","4 count Star fruit","Grapes Red Fresh Seedless","Grapes Green Fresh Seedless","Orange Naval Fresh 100 Count","Orange Naval Fresh 48 Count","Orange Naval Fresh 24 Count","Orange Naval Fresh 12 Count"]
basket = []
for s in strings:
grams = ngrams(s.split(),1) # 2 for 2-gram
for g in grams:
basket.append("-".join(g))
print (Counter(basket))
1克:
Counter({'Fresh': 9,'Orange': 4,'Naval': 4})
2克:
Counter({'Orange-Naval': 4,'Naval-Fresh': 4,'count-Star': 3})
显然,它不能很好地工作。
有什么更好的方法?谢谢。