根据许多条件合并两个 CSV 文件

我有两个 CSV 文件。它们具有相同的列,但同一列中的每一行都不是唯一的,如下所示:

gpo_full.csv:
    Date           hearing_sub_type   topic      Specific_Date
    January,1997   Oversight          weather    January 12,1997
    June,2000      General            life       June 5,2000
    January,1997   General            forest     January 1,1997
    April,2001     Oversight          people     NaN 
    June,2000      Oversight          depressed  June 6,1997   General            weather    January 1,2000      Oversight          depressed  June 5,2000

CAP_cols.csv:
    majortopic   id     Chamber   topic           Date           Specific_Date
    21           79846  1         many forest     January,1997   January 1,1997
    4            79847  2         emotion         June,2000      June 6,2000
    13           79848  1         NaN             May,2001       NaN
    7            79849  2         good life       June,2000      June 5,2000
    21           79850  1         good weather    January,1997
    25           79851  1         rain & cloudy   January,1997   January 12,1997
    6            79852  2         sad & depressed June,2000

我想使用三个条件来匹配这些数据:Specific_Date、Date 和 topic。
首先,我想使用“日期”列对这些数据进行分组。接下来,我尝试使用“Specific_Date”列来缩小范围,因为此列中丢失了一些数据。最后,我想通过词嵌入等类似词使用“主题”列,以确保 gpo_full 中的哪些行可以与 CAP_cols 中的唯一行对应。
我尝试使用“日期”列对数据进行分组并将它们合并到 JSON 文件中。但是,我陷入了按特定日期和主题缩小范围的下一步工作中。
我对这个输出的想法是:

{
"Date": "January,1997","Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather","CAP": "rain & cloudy"}
"GPO": {
    "hearing_sub_type": "Oversight",and other columns
}
"CAP": {
    "majortopic": "25","id": "79851","Chamber": "1"
}
},{
"Date": "January,"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest","CAP": "many forest"}
"GPO": {
    "hearing_sub_type": "General",and other columns
}
"CAP": {
    "majortopic": "21","id": "79846","Chamber": "1"
}
and similar for others}

我想了三天,还是没有头绪。实现这一目标的任何想法都会非常有帮助!非常感谢!

lixb07072014 回答:根据许多条件合并两个 CSV 文件

主题匹配存在一些问题,因此您需要扩展我使用的 match_topic() 方法,但我添加了一些逻辑以查看最后不匹配的内容。

results 变量包含一个 dict 列表,您可以轻松地将其保存为 JSON 文件。

检查内联注释以了解我使用的逻辑的推理。

旁注:

如果我是你,我会稍微重构一下 JSON。将 topic 作为键/值对放在 GPOCAP 键下对我来说比使用带有单独 TopicGPO 键的 CAP 键更有意义import csv from pprint import pprint import json # load gpo_full.csv into a list of dict using # csv.DictReader & list comprehension with open("path/to/file/gpo_full.csv") as infile: gpo_full = [item for item in csv.DictReader(infile)] # do the same for CAP_cols.csv with open("path/to/file/CAP_cols.csv") as infile: cap_cols = [item for item in csv.DictReader(infile)] def match_topic(gpo_topic: str,cap_topic: str) -> bool: """We need a function as some of the mapping is not simple Args: gpo_topic (str): gpo topic cap_topic (str): CAP topic Returns: bool: True if topics match """ # this one is simple if gpo_topic in cap_topic: return True # you need to repeat the below conditional check # for each custom topic matching elif gpo_topic == "weather" and cap_topic == "rain & cloudy": return True # example secondary topic matching elif gpo_topic == "foo" and cap_topic == "bar": return True # finally return false for no matches return False # we need this later gpo_length = len(gpo_full) results = [] cap_left_over = [] # do the actual mapping # this could've been done above,but I separated it intentionally for cap in cap_cols: found = False # first find the corresponding gpo for index,gpo in enumerate(gpo_full): if ( gpo["Specific_Date"] == cap["Specific_Date"] # check by date and match_topic(gpo["topic"],cap["topic"]) # check if topics match ): results.append({ "Date": gpo["Date"],"Specific_Date": gpo["Specific_Date"],"Topic": { "GPO": gpo["topic"],"CAP": cap["topic"] },"GPO": { "hearing_sub_type": gpo["hearing_sub_type"] },"CAP": { "majortopic": cap["majortopic"],"id": cap["id"],"Chamber": cap["Chamber"] } }) # pop & break to remove the gpo item # this is so you're left over with a list of # gpo items that didn't match # it also speeds up further matches gpo_full.pop(index) found = True break # this is to check if there's stuff left over if not found: cap_left_over.append(cap) with open('path/to/file/combined_json.json','w') as outfile: json.dump(results,outfile,indent=4) pprint(results) print(f'\nLength:\n Results: {len(results)}\n CAP: {len(cap)}\n GPO: {gpo_length}') print('\nLeftover GPO:') pprint(gpo_full) print('\nLeftover CAP:') pprint(cap_left_over) 键/值对...

pprint(results)

输出
我已从输出中删除了 Length: Results: 5 CAP: 6 GPO: 7 Leftover GPO: [{'Date': 'April,2001','Specific_Date': 'NaN ','hearing_sub_type': 'Oversight','topic': 'people'},{'Date': 'June,2000','Specific_Date': 'June 6,'topic': 'depressed'}] Leftover CAP: [{'Chamber': '2','Date': 'June,'id': '79847','majortopic': '4','topic': 'emotion'},{'Chamber': '1','Date': 'May,'Specific_Date': 'NaN','id': '79848','majortopic': '13','topic': 'NaN'}] ,请进一步查看 JSON

Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,forest,"January 1,1997"
"April,2001",people,NaN 
"June,depressed,"June 6,2000"

path/to/file/gpo_full.csv

majortopic,id,Chamber,Date,Specific_Date
21,79846,1,many forest,"January,1997"
4,79847,2,emotion,"June,2000"
13,79848,NaN,"May,"NaN"
7,79849,good life,2000"
21,79850,good weather,1997"
25,79851,rain & cloudy,1997"
6,79852,sad & depressed,2000"

path/to/file/CAP_cols.csv

[
    {
        "Date": "January,"Specific_Date": "January 1,"Topic": {
            "GPO": "forest","CAP": "many forest"
        },"GPO": {
            "hearing_sub_type": "General"
        },"CAP": {
            "majortopic": "21","id": "79846","Chamber": "1"
        }
    },{
        "Date": "June,"Specific_Date": "June 5,"Topic": {
            "GPO": "life","CAP": "good life"
        },"CAP": {
            "majortopic": "7","id": "79849","Chamber": "2"
        }
    },{
        "Date": "January,"Topic": {
            "GPO": "weather","CAP": "good weather"
        },"id": "79850","Specific_Date": "January 12,"CAP": "rain & cloudy"
        },"GPO": {
            "hearing_sub_type": "Oversight"
        },"CAP": {
            "majortopic": "25","id": "79851","Topic": {
            "GPO": "depressed","CAP": "sad & depressed"
        },"CAP": {
            "majortopic": "6","id": "79852","Chamber": "2"
        }
    }
]

path/to/file/combined_json.json

subset
本文链接:https://www.f2er.com/71539.html

大家都在问