如何解决“ TypeError:无法在类似字节的对象上使用字符串模式”

我正在尝试标记新闻文章,其中我从URL中提取了文本。但是,当我尝试使用send_tokenize时,似乎遇到了将类似字节的对象转换为字符串的错误。

这是我尝试过的新事物,因此,感谢您对此的帮助。

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
def getTextWaPo(url):
    page = urlopen(url).read().decode('utf8')
    soup = BeautifulSoup(page,"lxml")
    text = ' '.join(map(lambda p: p.text,soup.find_all('article')))
    return text.encode('ascii',errors='replace').replace(b"?",b" ")
getTextWaPo(articleURL)


from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from string import punctuation

sents=sent_tokenize(text)
sents

这是我遇到的错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-21-2b4d24de1e83> in <module>
----> 1 sents=sent_tokenize(text)
      2 sents

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text,language)
    104     """
    105     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
--> 106     return tokenizer.tokenize(text)
    107 
    108 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in tokenize(self,text,realign_boundaries)
   1275         Given a text,returns a list of the sentences in that text.
   1276         """
-> 1277         return list(self.sentences_from_text(text,realign_boundaries))
   1278 
   1279     def debug_decisions(self,text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in sentences_from_text(self,realign_boundaries)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s,e in self.span_tokenize(text,realign_boundaries)]
   1332 
   1333     def _slices_from_text(self,text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in <listcomp>(.0)
   1329         follows the period.
   1330         """
-> 1331         return [text[s:e] for s,text):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in span_tokenize(self,realign_boundaries)
   1319         if realign_boundaries:
   1320             slices = self._realign_boundaries(text,slices)
-> 1321         for sl in slices:
   1322             yield (sl.start,sl.stop)
   1323 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _realign_boundaries(self,slices)
   1360         """
   1361         realign = 0
-> 1362         for sl1,sl2 in _pair_iter(slices):
   1363             sl1 = slice(sl1.start + realign,sl1.stop)
   1364             if not sl2:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _pair_iter(it)
    316     it = iter(it)
    317     try:
--> 318         prev = next(it)
    319     except StopIteration:
    320         return

~\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\punkt.py in _slices_from_text(self,text)
   1333     def _slices_from_text(self,text):
   1334         last_break = 0
-> 1335         for match in self._lang_vars.period_context_re().finditer(text):
   1336             context = match.group() + match.group('after_tok')
   1337             if self.text_contains_sentbreak(context):

TypeError: cannot use a string pattern on a bytes-like object
ruoshui105 回答:如何解决“ TypeError:无法在类似字节的对象上使用字符串模式”

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3140672.html

大家都在问