从字符串中优化核心信息

带有字符串详细信息的水果列表。我想优化核心信息,即字符串中的水果名称。

其中一些名称中有1个字(例如Apple,Grape),有2个名称(西瓜,生果)。

我尝试了ngram方式:

from nltk import ngrams
from collections import Counter

strings = [
"Apples Fresh Golden Delicious","Apples Fresh Red Delicious 12","Apple Sliced 12","Apple Diced 24","Water melon Fresh Petite Green on the turn","16 count Star fruit","8 count Star fruit","4 count Star fruit","Grapes Red Fresh Seedless","Grapes Green Fresh Seedless","Orange Naval Fresh 100 Count","Orange Naval Fresh 48 Count","Orange Naval Fresh 24 Count","Orange Naval Fresh 12 Count"]

basket = []

for s in strings:
    grams = ngrams(s.split(),1) # 2 for 2-gram

    for g in grams:
      basket.append("-".join(g))

print (Counter(basket))

1克:

Counter({'Fresh': 9,'Orange': 4,'Naval': 4})

2克:

Counter({'Orange-Naval': 4,'Naval-Fresh': 4,'count-Star': 3})

显然,它不能很好地工作。

有什么更好的方法?谢谢。

yyaiwdk1 回答:从字符串中优化核心信息

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3161159.html

大家都在问