我试图弄清楚如何从spaCy Doc
对象中删除停用词,同时保留具有其所有属性的原始父对象。
import en_core_web_md
nlp = en_core_web_md.load()
sentence = "The frigate was decommissioned following Britain's declaration of peace with France in 1763,but returned to service in 1766 for patrol duties in the Caribbean"
tokens = nlp(sentence)
print("Parent type:",type(tokens))
print("Token type:",type(tokens[0]))
print("Sentence vector:",tokens.vector)
print("Word vector:",tokens[0].vector)
返回:
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 8.35970342e-02 1.38482109e-01 7.71872401e-02 -7.14236796e-02
...]
Word vector: [ 2.7204e-01 -6.2030e-02 -1.8840e-01 2.3225e-02 -1.8158e-02 6.7192e-03
...]
删除停用词的典型解决方案是使用列表理解:
noStopWords = [t for t in tokens if not t.is_stop]
print("Parent type:",type(noStopWords))
print("Token type:",type(noStopWords[0]))
try:
print("Sentence vector:",noStopWords.vector)
except AttributeError as e:
print(e)
try:
print("Word vector:",noStopWords[0].vector)
except AttributeError as e:
print(e)
由于现在,父对象是Token
对象的列表,不再是Doc
对象,因此它不再具有原始属性,因此代码返回:
Parent type: <class 'list'>
Token type: <class 'spacy.tokens.token.Token'>
'list' object has no attribute 'vector'
Word vector: [ 9.4139e-01 -5.9546e-01 5.5007e-01 3.7544e-01 2.3021e-02 -4.4260e-01
...]
所以我能找到的相当糟糕的方法是从标记中重建一个字符串,然后对其进行重新处理。这很麻烦,因为它是双重工作,并且nlp
方法已经很慢了。
noStopWordsDoc = nlp(' '.join([t.text for t in noStopWords]))
print("Parent type:",type(noStopWordsDoc))
print("Token type:",type(noStopWordsDoc[0]))
try:
print("Sentence vector:",noStopWordsDoc.vector)
except AttributeError as e:
print(e)
try:
print("Word vector:",noStopWordsDoc[0].vector)
except AttributeError as e:
print(e)
Parent type: <class 'spacy.tokens.doc.Doc'>
Token type: <class 'spacy.tokens.token.Token'>
Sentence vector: [ 9.78216752e-02 1.06186338e-01 1.66255698e-01 -9.38376933e-02
...]
现在,必须有一个更好的方法,对吧?