pyarrow read_csv，每行具有不同的列数

2024-05-17 • 问答

我的CSV文件包含1400万行，并且列数可变。前27列将始终可用，并且一行最多可以再增加16列，总共43列。

使用香草大熊猫，我已经找到了解决方法：

largest_column_count = 0
with open(data_file,'r') as temp_f:
    lines = temp_f.readlines()
    for l in lines:
        column_count = len(l.split(',')) + 1
        largest_column_count = column_count if largest_column_count < column_count else largest_column_count
temp_f.close()

column_names = [i for i in range(0,largest_column_count)]
all_columns_df = pd.read_csv(file,header=None,delimiter=',',names=column_names,dtype='category').replace(pd.np.nan,'',regex=True)

这将创建包含我所有数据以及不可用数据的空白单元格的表。如果文件较小，则效果很好。有了完整的文件，我的内存使用量就大大增加了。

我一直在阅读有关Apache Arrow的信息，经过几次尝试加载结构化的csv文件（每行相同的列数）后，我印象深刻。我尝试使用与上述相同的概念加载数据文件：

fixed_column_names = [str(i) for i in range(0,27)]
extra_column_names = [str(i) for i in range(len(fixed_column_names),largest_column_count)]

total_columns = fixed_column_names
total_columns.extend(extra_column_names)

read_options = csv.ReadOptions(column_names=total_columns)
convert_options = csv.ConvertOptions(include_columns=total_columns,include_missing_columns=True,strings_can_be_null=True)

table = csv.read_csv(edr_filename,read_options=read_options,convert_options=convert_options)

但出现以下错误

异常：CSV分析错误：预期有43列，有32个

我需要使用pyarrow提供的csv，否则我将无法创建pyarrow表然后转换为熊猫

from pyarrow import csv

有人遇到过同样的问题并且可以帮助我吗？

编辑：

修复了第二个代码块

pyarrow read_csv，每行具有不同的列数

bsxs121 回答：pyarrow read_csv，每行具有不同的列数

大家都在问