使用NLTK和熊猫通过3个句子对文本进行标记

2024-05-18 • 问答

我在熊猫中有一个数据框-1列名为“文本”。文本的长度不同，但是我需要将每个文本标记为3个句子，然后替换原始数据框。

有人可以帮忙吗？

这应该可以为您提供所需的东西。最终的DataFrame有一列，每行列出3个句子：

text_df = pd.DataFrame(["hello world. hello america. hello europe. hello africa.","hello world. hello antartica. hello europe. hello asia.","hello world. hello africa. hello antartica. hello world."],columns = ['text'])

text_df['text'] = text_df['text'].map(lambda x: re.split(r' *[\.\?!][\'"\)\]]* *',x)[:3]) # get list of 3 sentences

输出

print(text_df.to_string())
                                           text
0    [hello world,hello america,hello europe]
1  [hello world,hello antartica,hello europe]
2  [hello world,hello africa,hello antartica]

使用NLTK和熊猫通过3个句子对文本进行标记

ayangyy 回答：使用NLTK和熊猫通过3个句子对文本进行标记

大家都在问