我正在研究一个NLP项目,并试图按段落标记“大期望”,然后存储到列表中。我需要执行此操作以执行一些无监督的学习主题模型。
#reading in great expectations
fp = open("dickens-great.txt")
great = fp.read()
print(great[0:100])
#processing
great_paras=[]
for paragraph in great:
para=paragraph[0]
#removing the double-dash from all words
para=[re.sub(r'--','',word) for word in para]
#Forming each paragraph into a string and adding it to the list of strings.
great_paras.append(para)
print(great_paras[0:4])
我得到的回报是:
[['C'],['h'],['a'],['p']]
如您所见,在打印great_paras变量时遇到了问题,因为它是按字母而不是段落来拆分文本。我已经为此苦苦挣扎了一段时间,但将不胜感激!