在预先标记的文本上使用空格

2024-05-18 • 问答

我想使用spacy处理已经预先加标记的文本。将令牌列表解析为spacy无效。

import spacy
nlp = spacy.load("en_core_web_sm")
nlp(["This","is","a","sentence"])

这给出TypeError（这很有意义）： TypeError: Argument 'string' has incorrect type (expected str,got list)

我可以用自定义标记替换令牌生成器，但是我觉得那样会使事情变得复杂，而不是首选方式。

谢谢您的帮助：D

您可以使用此方法：

tokens = ["This","is","a","sentence"]
sentence = nlp.tokenizer.tokens_from_list(tokens)
print(sentence)

This is a sentence

如果您使用：

sentence = nlp.tokenizer.tokens_from_list(tokens)与spacy.matcher / Matcher会出现错误：

尝试使用nlp（）代替nlp.make_doc（）或list（nlp.pipe（））代替列表（nlp.tokenizer.pipe（））。

我的解决方式：在for循环中遍历每个项目：

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern = [{'LEMMA': 'sentence','POS': 'NOUN'}]
matcher.add('Searched Word',None,pattern)
X = ["Sentence one","Sentence two","Sentence three","sentence last !"]
for i in X.index:
    doc = nlp(X[i])
    matches = matcher(doc)
    for match_id,start,end in matches:
       matched_span = doc[start:end]
       print(matched_span.text)

一种更好的方法是使用 nlp.pipe ：

for doc in nlp.pipe(X):
print([token.text for token in doc])

也有利于更快的算法运行和更有效的文本处理。

希望这会有所帮助。谢谢。

在预先标记的文本上使用空格

whyzywj 回答：在预先标记的文本上使用空格

大家都在问