如何使用Pandas groupby()将带有逗号分隔的项目列字符串聚合到列表中?

我有如下数据:

NAME    ETHNICITY_RECAT TOTAL_LENGTH    3LETTER_SUBSTRINGS
joseph  fr              14              jos,ose,sep,eph
ann     en              16              ann
anne    ir              14              ann,nne
tom     en              18              tom
tommy   fr              16              tom,omm,mmy
ann     ir              19              ann
... more rows

3LETTER_SUBSTRINGS值是一个字符串,它捕获NAME变量的所有3个字母的子字符串。我想将其聚合到一个列表中,每个行之间都附加一个逗号分隔的项目,并将其视为一个列表项目。如下:

ETHNICITY_RECAT TOTAL_LENGTH            3LETTER_SUBSTRINGS
                min max mean            <lambda>
fr              2   26  13.22           [jos,eph,tom,oom,mmy,...]
en              3   24  11.92           [ann,...]
ir              4   23  12.03           [ann,nne,ann,...]

我使用以下代码来“完成”它:

aggregations = {
    'TOTAL_LENGTH': [min,max,'mean'],'3LETTER_SUBSTRINGS': lambda x: list(x),}

self.df_agg = self.df.groupby('ETHNICITY_RECAT',as_index=False).agg(aggregations)

问题在于整个字符串“ ann,anne”被认为是最终列表中的单个列表项,而不是将每个字符串都视为单个列表项,例如“ ann”,“ anne”。

我希望看到子字符串的最高频率,但是现在,当我运行以下代码时,我得到了整个字符串的频率(而不是单个的3个字母的子字符串):

from collections import Counter 
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)

我明白了:

[('jos,eph',19),('ann,nee',5),...]

不是我想要的:

[('jos',('ose',('sep',23),('eph',('ann',15),('nee',...]

我尝试过:

'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(','),

但是它显示invalid syntax

footballfl1987 回答:如何使用Pandas groupby()将带有逗号分隔的项目列字符串聚合到列表中?

您要做的第一件事是将字符串转换为列表,然后将其与groupby一起用作agg

df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(',')

df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],'3LETTER_SUBSTRINGS':'sum'})

输出:

                TOTAL_LENGTH                             3LETTER_SUBSTRINGS
                         min max  mean                                  sum
ETHNICITY_RECAT                                                            
en                        16  18  17.0                           [ann,tom]
fr                        14  16  15.0  [jos,ose,sep,eph,tom,omm,mmy]
ir                        14  19  16.5                      [ann,nne,ann]
,

我认为您的大多数代码都还不错,您只是误解了该错误:它与字符串转换无关。在3LETTER_SUBSTRING列的每个单元格中都有列表/元组。使用lambda x:list(x)函数时,将创建一个元组列表。因此,没有像split(",")那样要做的事情,也不会强制转换为字符串并返回表...

相反,在创建新列表时,您只需要取消嵌套表格即可。因此,这是一个可重复的小代码:(请注意,我专注于您的元组/聚合问题,因为我相信您会很快找到其余的代码)

import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom",("tommy","fr")]
df = pd.DataFrame(names,columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity,and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})

使用您指定的计数器,我得到了

dfg = df.groupby('ethnicity',as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1,'tom': 1})

要以元组列表的形式获取它,只需使用内置的字典功能,例如dict.items()


更新:如问题中那样使用预格式化的字符串列表:

import pandas as pd
# Create some data
names = [("joseph","fr","jos,eph"),"en","ann"),"ir","ann,nne"),"tom"),"tom,mmy"),"ann")]
df = pd.DataFrame(names,"ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
    """
    Update this function with regard to your data format
    """
    return x.split(",")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity',as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))
本文链接:https://www.f2er.com/3143061.html

大家都在问