我有一句话,希望看到如下所示的标记。
Sentence: "[x] works for [y] in [z]."
Tokens: ["[","x","]","works","for","[","y","in","z","."]
Expected: ["[x]","[y]","[z]","."]
如何通过自定义标记生成器功能做到这一点?
我有一句话,希望看到如下所示的标记。
Sentence: "[x] works for [y] in [z]."
Tokens: ["[","x","]","works","for","[","y","in","z","."]
Expected: ["[x]","[y]","[z]","."]
如何通过自定义标记生成器功能做到这一点?
您可以从令牌生成器的前缀和后缀中删除[
和]
,以使括号不会与相邻令牌分离:
import spacy
nlp = spacy.load('en_core_web_sm')
prefixes = list(nlp.Defaults.prefixes)
prefixes.remove('\\[')
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search
suffixes = list(nlp.Defaults.suffixes)
suffixes.remove('\\]')
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
doc = nlp("[x] works for [y] in [z].")
print([t.text for t in doc])
# ['[x]','works','for','[y]','in','[z]','.']
相关文档在这里:
https://spacy.io/usage/linguistic-features#native-tokenizer-additions