如何使用Pandas groupby（）将带有逗号分隔的项目列字符串聚合到列表中？

2024-05-17 • 问答

我有如下数据：

NAME    ETHNICITY_RECAT TOTAL_LENGTH    3LETTER_SUBSTRINGS
joseph  fr              14              jos,ose,sep,eph
ann     en              16              ann
anne    ir              14              ann,nne
tom     en              18              tom
tommy   fr              16              tom,omm,mmy
ann     ir              19              ann
... more rows

3LETTER_SUBSTRINGS值是一个字符串，它捕获NAME变量的所有3个字母的子字符串。我想将其聚合到一个列表中，每个行之间都附加一个逗号分隔的项目，并将其视为一个列表项目。如下：

ETHNICITY_RECAT TOTAL_LENGTH            3LETTER_SUBSTRINGS
                min max mean            <lambda>
fr              2   26  13.22           [jos,eph,tom,oom,mmy,...]
en              3   24  11.92           [ann,...]
ir              4   23  12.03           [ann,nne,ann,...]

我使用以下代码来“完成”它：

aggregations = {
    'TOTAL_LENGTH': [min,max,'mean'],'3LETTER_SUBSTRINGS': lambda x: list(x),}

self.df_agg = self.df.groupby('ETHNICITY_RECAT',as_index=False).agg(aggregations)

问题在于整个字符串“ ann，anne”被认为是最终列表中的单个列表项，而不是将每个字符串都视为单个列表项，例如“ ann”，“ anne”。

我希望看到子字符串的最高频率，但是现在，当我运行以下代码时，我得到了整个字符串的频率（而不是单个的3个字母的子字符串）：

from collections import Counter 
x = self.df_agg_eth[self.df_agg_eth['ETHNICITY_RECAT']=='en']['3LETTER_SUBSTRINGS']['<lambda>']
x_list = x[0]
c = Counter(x_list)

我明白了：

[('jos,eph',19),('ann,nee',5),...]

不是我想要的：

[('jos',('ose',('sep',23),('eph',('ann',15),('nee',...]

我尝试过：

'3LETTER_SUBSTRINGS': lambda x: list(i) for i in x.split(','),

但是它显示invalid syntax。

footballfl1987 回答：如何使用Pandas groupby（）将带有逗号分隔的项目列字符串聚合到列表中？

您要做的第一件事是将字符串转换为列表，然后将其与groupby一起用作agg：

df['3LETTER_SUBSTRINGS'] = df['3LETTER_SUBSTRINGS'].str.split(',')

df.groupby('ETHNICITY_RECAT').agg({'TOTAL_LENGTH':['min','max','mean'],'3LETTER_SUBSTRINGS':'sum'})

输出：

                TOTAL_LENGTH                             3LETTER_SUBSTRINGS
                         min max  mean                                  sum
ETHNICITY_RECAT                                                            
en                        16  18  17.0                           [ann,tom]
fr                        14  16  15.0  [jos,ose,sep,eph,tom,omm,mmy]
ir                        14  19  16.5                      [ann,nne,ann]

我认为您的大多数代码都还不错，您只是误解了该错误：它与字符串转换无关。在3LETTER_SUBSTRING列的每个单元格中都有列表/元组。使用lambda x:list(x)函数时，将创建一个元组列表。因此，没有像split(",")那样要做的事情，也不会强制转换为字符串并返回表...

相反，在创建新列表时，您只需要取消嵌套表格即可。因此，这是一个可重复的小代码：（请注意，我专注于您的元组/聚合问题，因为我相信您会很快找到其余的代码）

import pandas as pd
# Create some data
names = [("joseph","fr"),("ann","en"),("anne","ir"),("tom",("tommy","fr")]
df = pd.DataFrame(names,columns=["NAMES","ethnicity"])
df["3LETTER_SUBSTRING"] = df["NAMES"].apply(lambda name: [name[i:i+3] for i in range(len(name) - 2)])
print(df)
# Aggregate the 3LETTER per ethnicity,and unnest the result in a new table for each ethnicity:
df.groupby('ethnicity').agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})

使用您指定的计数器，我得到了

dfg = df.groupby('ethnicity',as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
from collections import Counter
print(Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0]))
# Counter({'ann': 1,'tom': 1})

要以元组列表的形式获取它，只需使用内置的字典功能，例如dict.items()。

更新：如问题中那样使用预格式化的字符串列表：

import pandas as pd
# Create some data
names = [("joseph","fr","jos,eph"),"en","ann"),"ir","ann,nne"),"tom"),"tom,mmy"),"ann")]
df = pd.DataFrame(names,"ethnicity","3LETTER_SUBSTRING"])
def transform_3_letter_to_table(x):
    """
    Update this function with regard to your data format
    """
    return x.split(",")
df["3LETTER_SUBSTRING"] = df["3LETTER_SUBSTRING"].apply(transform_3_letter_to_table)
print(df)
# Applying aggregation
dfg = df.groupby('ethnicity',as_index=False).agg({
    "3LETTER_SUBSTRING": lambda x:[z for y in x for z in y]
})
print(dfg)
# test on some data
from collections import Counter
c = Counter(dfg[dfg["ethnicity"] == "en"]["3LETTER_SUBSTRING"][0])
print(c)
print(list(c.items()))

aggregation pandas pandas-groupby python-3.x

本文链接：https://www.f2er.com/3143061.html

如何使用Pandas groupby（）将带有逗号分隔的项目列字符串聚合到列表中？

footballfl1987 回答：如何使用Pandas groupby（）将带有逗号分隔的项目列字符串聚合到列表中？

大家都在问