我正在使用Counter
来跟踪出现子字符串是文本的次数,但是我正在搜索具有最高得分的子字符串。分数定义为len(substring) * (occurences-1)
。
目前我正在这样做:
from collections import Counter
from operator import itemgetter
input_string = "My amazing string with all sorts of values in it. " + \
"Kaas is lekker! I want to know how many times a certain substring" + \
"of a minumum size appears in it,so I can so some value encodings" + \
". Kaas is lekker! Performance is a problem when a string become l" + \
"arger. Kaas is lekker! So many strings to replace,what string is" + \
" best?"
larger_then = 5
length = len(input_string)
subs = [input_string[i:j+1] \
for i in range(0,length-larger_then) \
for j in range(i+larger_then,length)]
countr = Counter()
countr.update(subs)
scores = map(lambda kv: (kv[0],len(kv[0]) * (kv[1]-1)),countr.most_common())
max_key,max_score = max(scores,key=itemgetter(1))
print("")
print("Max key is '{}' with score {}".format(max_key,max_score))
print("")
top_20 = list(countr.most_common(20))
print("Top 20 commons:",*top_20,sep="\n- ")
哪个返回:
Max key is '. Kaas is lekker! ' with score 36
Top 20 commons:
- ('string',5)
- (' strin',4)
- (' string',4)
- (' string ',3)
- ('string ',3)
- ('tring ',3)
- ('. Kaas',3)
- ('. Kaas ',3)
- ('. Kaas i',3)
- ('. Kaas is',3)
- ('. Kaas is ',3)
- ('. Kaas is l',3)
- ('. Kaas is le',3)
- ('. Kaas is lek',3)
- ('. Kaas is lekk',3)
- ('. Kaas is lekke',3)
- ('. Kaas is lekker',3)
- ('. Kaas is lekker!',3)
- ('. Kaas is lekker! ',3)
- (' Kaas ',3)
是否有更快的方法来计数具有不同分数的子字符串?
以下是带有运行代码的Repl.it的链接:https://repl.it/@keestalkstech/CavernousKindheartedScripts