将JSON文件转换为Pandas数据框

我想将JSON转换为Pandas数据框。

我的JSON看起来像: 喜欢:

{ 
   "country1":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":1111,"feature2":1112
         },"floor_price2":{ 
            "feature1":1121
         }
      },"AdUnit2":{ 
         "floor_price1":{ 
            "feature1":1211
         },"floor_price2":{ 
            "feature1":1221
         }
      }
   },"country2":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":2111,"feature2":2112
         }
      }
   }
}

我使用以下代码从GCP中读取了文件:

project = Context.default().project_id
sample_bucket_name = 'my_bucket'
sample_bucket_path = 'gs://' + sample_bucket_name
print('Object: ' + sample_bucket_path + '/json_output.json')

sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()

sample_object = sample_bucket.object('json_output.json')
list(sample_bucket.objects())
json = sample_object.read_stream()

我的目标是获得如下所示的Pandas数据框:

将JSON文件转换为Pandas数据框

我尝试使用json_normalize,但没有成功。

jhfuuu 回答:将JSON文件转换为Pandas数据框

嵌套JSON总是很难正确处理。

几个月前,我想出了一种使用here中写得很漂亮的 flatten_json_iterative_solution 来提供“通用答案”的方法:每个方法都反复进行解压缩给定json的级别。

然后可以将其简单地转换为 Pandas.Series 然后是 Pandas.DataFrame ,如下所示:

df = pd.Series(flatten_json_iterative_solution(dict(json_))).to_frame().reset_index()

Intermediate Dataframe result

可以轻松地执行某些数据转换以将索引拆分为您要求的列名称:

df[["index","col1","col2","col3","col4"]] = df['index'].apply(lambda x: pd.Series(x.split('_')))

Final result

,

您可以使用此:

def flatten_dict(d):
    """ Returns list of lists from given dictionary """
    l = []
    for k,v in sorted(d.items()):
        if isinstance(v,dict):
            flatten_v = flatten_dict(v)
            for my_l in reversed(flatten_v):
                my_l.insert(0,k)

            l.extend(flatten_v)

        elif isinstance(v,list):
            for l_val in v:
                l.append([k,l_val])

        else:
            l.append([k,v])

    return l

此函数接收字典(包括值也可以是列表的嵌套)并将其展平为列表。

然后,您可以简单地:

df = pd.DataFrame(flatten_dict(my_dict))

my_dict是您的JSON对象。 以您的示例为例,运行print(df)时得到的是:

          0        1             2         3     4
0  country1  AdUnit1  floor_price1  feature1  1111
1  country1  AdUnit1  floor_price1  feature2  1112
2  country1  AdUnit1  floor_price2  feature1  1121
3  country1  AdUnit2  floor_price1  feature1  1211
4  country1  AdUnit2  floor_price2  feature1  1221
5  country2  AdUnit1  floor_price1  feature1  2111
6  country2  AdUnit1  floor_price1  feature2  2112

在创建数据框时,您可以命名列和索引

,

您可以尝试这种方法:

 from google.cloud import storage
 import pandas as pd

 storage_client = storage.Client()
 bucket = storage_client.get_bucket('test-mvladoi')
 blob = bucket.blob('file')
 read_output = blob.download_as_string()
 data = json.loads(read_output)

 data_norm = json_normalize(data,max_level=5)
 df = pd.DataFrame(columns=['col1','col2','col3','col4','col5'])
 i = 0

 for col in b.columns:
     a,c,d,e = col.split('.')
     df.loc[i]  = [a,e,b[col][0]]
     i = i + 1

 print(df)
,

不是最好的方法,但是它是有效的。另外,您应该修改仅从此awnser

中选取的展平函数
test = { 
   "country1":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":1111,"feature2":1112
         },"floor_price2":{ 
            "feature1":1121
         }
      },"AdUnit2":{ 
         "floor_price1":{ 
            "feature1":1211
         },"floor_price2":{ 
            "feature1":1221
         }
      }
   },"country2":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":2111,"feature2":2112
         }
      }
   }
}

from collections import defaultdict
import pandas as pd
import collections

def flatten(d,parent_key='',sep='_'):
    items = []
    for k,v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v,collections.MutableMapping):
            items.extend(flatten(v,new_key,sep=sep).items())
        else:
            items.append((new_key,v))
    return dict(items)

results = defaultdict(list)   
colnames = ["col1","col4","col5","col6"]
for key,value in flatten(test).items():
    elements = key.split("_")
    elements.append(value)
    for colname,element in zip(colnames,elements):
        results[colname].append(element)

df = pd.DataFrame(results)
print(df)
本文链接:https://www.f2er.com/3166537.html

大家都在问