从spaCy Doc对象中删除停用词

我试图弄清楚如何从spaCy Doc对象中删除停用词,同时保留具有其所有属性的原始父对象。

import en_core_web_md
nlp = en_core_web_md.load()

sentence = "The frigate was decommissioned following Britain's declaration of peace with France in 1763,but returned to service in 1766 for patrol duties in the Caribbean"

tokens = nlp(sentence)
print("Parent type:",type(tokens))
print("Token type:",type(tokens[0]))
print("Sentence vector:",tokens.vector)
print("Word vector:",tokens[0].vector)

返回:

Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 8.35970342e-02  1.38482109e-01  7.71872401e-02 -7.14236796e-02
...]
Word vector: [ 2.7204e-01 -6.2030e-02 -1.8840e-01  2.3225e-02 -1.8158e-02  6.7192e-03
...]

删除停用词的典型解决方案是使用列表理解:

noStopWords = [t for t in tokens if not t.is_stop]
print("Parent type:",type(noStopWords))
print("Token type:",type(noStopWords[0]))
try:
    print("Sentence vector:",noStopWords.vector)
except AttributeError as e:
    print(e)
try:
    print("Word vector:",noStopWords[0].vector)
except AttributeError as e:
    print(e)

由于现在,父对象是Token对象的列表,不再是Doc对象,因此它不再具有原始属性,因此代码返回:

Parent type: <class 'list'>
Token type: <class 'spacy.tokens.token.Token'>
'list' object has no attribute 'vector'
Word vector: [ 9.4139e-01 -5.9546e-01  5.5007e-01  3.7544e-01  2.3021e-02 -4.4260e-01
...]

所以我能找到的相当糟糕的方法是从标记中重建一个字符串,然后对其进行重新处理。这很麻烦,因为它是双重工作,并且nlp方法已经很慢了。

noStopWordsDoc = nlp(' '.join([t.text for t in noStopWords]))
print("Parent type:",type(noStopWordsDoc))
print("Token type:",type(noStopWordsDoc[0]))
try:
    print("Sentence vector:",noStopWordsDoc.vector)
except AttributeError as e:
    print(e)
try:
    print("Word vector:",noStopWordsDoc[0].vector)
except AttributeError as e:
    print(e)
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 9.78216752e-02  1.06186338e-01  1.66255698e-01 -9.38376933e-02
...]

现在,必须有一个更好的方法,对吧?

iCMS 回答:从spaCy Doc对象中删除停用词

直接引用spaCy的开发人员之一Ines Montani:

spaCy Doc的核心原则之一是它应该始终 代表原始输入:

spaCy的标记化是非破坏性的,因此它始终代表 原始输入文本,绝不会添加或删除任何内容。很好 Doc对象的核心原则:您应该始终能够 重建并复制原始输入文本。

请参阅此答案:Can a token be removed from a spaCy document during pipeline processing?

本文链接:https://www.f2er.com/1989538.html

大家都在问