在使用Pandas的Python中，是否可以分块读取4B行并针对内存中已有的30M行数据帧过滤每个卡盘？

2024-05-19 • 问答

在Oracle中有一个4B行表和一个30M行CSV，两个表共享2列，我想在这些列上使用较小的表过滤较大的表。由于安全限制，我无法将3000万行CSV加载到Oracle中并运行单个连接，这是理想的选择。我也曾尝试使用SAS Enterprise Guide进行此过程，但是它似乎使大型联接阻塞，并且在与Oracle表的连接超时之前无法返回。

Python似乎是一种可能的解决方案，但是4B行表将不适合内存，即使减少到我需要的6列（每个25字符以下的6个字符串）也是如此。理想情况下，我想执行以下操作：

csv_df = pd.read_csv(file_path)
result_df = (empty dataframe)
df_chunks = pd.read_sql(sql_query,con,chunksize = 10000000)
    for df_chunk in df_chunks:
      # convert chunk to dataframe
      # join chunk_dataframe to csv_df to get a filtered result
      # concatenate filtered result to result_df

数据框result_df将成为4B行Oracle表中所有已过滤行的集合。

感谢您的帮助！

csv_df = pd.read_csv(file_path) result_df = (empty dataframe) df_chunks = pd.read_sql(sql_query,con,chunksize = 10000000) chunk_list = [] for df_chunk in df_chunks: result = pd.merge(df_chunk,csv_df,on=['XXX']) chunk_list.append(result) result_df = pd.concat(chunk_list)

在使用Pandas的Python中，是否可以分块读取4B行并针对内存中已有的30M行数据帧过滤每个卡盘？

always1988 回答：在使用Pandas的Python中，是否可以分块读取4B行并针对内存中已有的30M行数据帧过滤每个卡盘？

大家都在问