我正在处理一个非常大的CSV文件(将近6 GB),并且绝对有很多错误。例如,如果我有以下csv文件/表:
+------------+-------------+------------+
| ID | Date | String |
+------------+-------------+------------+
| 123456 | 09-20-2019 | ABCDEFG |
| 123abc456 | 10-30-2019 | HIJKLMN |
| 7891011 | jdqhouehwf | OPQRSTU |
| 1010101 | 03-15-2018 | 8473737 |
| 4823.00 | 02-11-2015 | VWXYZ |
| 2348813.0 | 01-23-2016 | BAZ |
+------------+-------------+------------+
或:
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
123abc456,"10-30-2019","HIJKLMN"
7891011,"jdqhouehwf","OPQRSTU"
1010101,"03-15-2018",8473737
4823.00,"02-11-2015","VWXYZ"
"2348813.0","01-23-2016","BAZ"
我想要一种排除故障并修复文件的好方法。使用熊猫,我可以读取文件:
import pandas as pd
df = pd.read_csv(inputfile)
熊猫总是会抱怨:
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False
所以我想清理每一列。但是由于它是一个很大的文件,所以我不能只打印整个表以带遮罩输出并期望读取它。我想要一种简单的方法来获取列并检查其是否符合类型。另外,如果可能的话,我想要一种删除不良行和/或将行转换为正确格式的方法。说完一切之后,我希望文件看起来像(不包括嵌入式注释):
"ID","ABCDEFG"
# 123abc456,"HIJKLMN" was deleted because the ID wasn't a number
# 7891011,"OPQRSTU" was deleted because the data was not a date
1010101,"8473737" # The last number could be converted to string
4823,"VWXYZ" # The first number could be converted to integer
2348813,"BAZ" # The ID number could be converted to int