在SpaCy中将实体替换为其标签

2024-05-16 • 问答

SpaCy是否总会用其标签替换SpaCy NER检测到的实体？例如： 我在玩Apple Macbook时正在吃一个苹果。

我已经使用SpaCy训练了NER模型来检测“水果”实体，并且该模型成功地将第一个“苹果”检测为“水果”，但没有检测到第二个“苹果”。

我想通过将每个实体替换为其标签来对数据进行后处理，因此我想用“水果”替换第一个“苹果”。句子将是“ 我在玩苹果Macbook时正在吃水果。”

如果我仅使用正则表达式，它也会将第二个“ Apple”替换为“ FRUITS”，这是不正确的。有什么聪明的方法可以做到这一点吗？

谢谢！

实体标签是令牌的属性（请参见here）

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')

s = "His friend Nicolas is here."
doc = nlp(s)

print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His','friend','PERSON','is','here','.']

print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here .

编辑：

为了处理实体可以跨越多个单词的情况，可以使用以下代码代替：

s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
    start = e.start_char
    end = start + len(e.text)
    newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.

@DBaker 答案的略短版本，它使用 end_char 而不是计算它：

for ent in reversed(doc.ents):
    text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]

当实体可以跨越多个单词时，对上面@DBaker 的解决方案进行了更优雅的修改：

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("merge_entities")

s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)

print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His','with','and','.']

print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here with PERSON and PERSON .

您可以查看有关 Spacy here 的文档。它使用内置的流水线来完成这项工作，并且对多处理有很好的支持。我相信这是官方支持的用标签替换实体的方式。

在SpaCy中将实体替换为其标签

QQ306551278 回答：在SpaCy中将实体替换为其标签

大家都在问