我已经收到了我要读取的数千个RTF文件的转储。其中大多数可以使用utf8编码打开。大约20%不能。在这20%中,chardet声称大约90%是“ Windows-1254”。其余的只是说“无”。因此,我现在尝试对它们进行解码和编码。在所有情况下,它都无效。我将先显示代码并输出,然后再详细说明:
for filename in all_rtf_filenames:
try:
with open(filename,'r') as f:
f.read()
except UnicodeDecodeError as e:
with open(filename,'rb') as f:
bytes_data = f.read()
_encoding = chardet.detect(bytes_data).get('encoding')
if _encoding == "Windows-1254":
try:
with open(filename,'rb') as f:
data = f.read()
data = data.decode('cp1254').encode('utf8')
except UnicodeDecodeError as e:
print("\n" + "-"*12 + "\nERROR CHANGING ENCODING")
print(e)
样本(部分)输出:
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x8e in position 1254: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x90 in position 90744: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x8f in position 1610: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x9e in position 1454: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x8e in position 540: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x8d in position 834: character maps to <undefined>
------------
ERROR CHANGING ENCODING
'charmap' codec can't decode byte 0x9e in position 1366: character maps to <undefined>
进一步:
-
我已经在Libre Office中打开了其中一些文件。它们都已打开,并显示清晰的英语。
-
对于处理“ Windows-1254”特殊字符,我并不珍惜。甚至丢掉除ASCII以外的所有内容都可以。
-
我也尝试了这个新的charset-normalizer library。每次尝试检测编码都会返回“无”。
-
我查看了this article,这表明其中某些字节可以通过不良的编码/解码进入。但是我不确定,因为本文没有提到所有这些字节。
有什么办法可以打开此文件吗?如果Libre可以做到,那一定有可能。