通过自定义函数使用聚合（使用基于另一列的值）

2024-05-06 • 问答

我有一张像这样的桌子。

benchmark    technqiue       stat            value

perlbench   compression encoding_Zero               10
perlbench   compression encoding_Repeated_Values    20
perlbench   compression encoding_Base8_1            30
perlbench   compression encoding_Base8_2            40
perlbench   compression encoding_Base8_4            50
perlbench   compression encoding_Base4_1            60
perlbench   compression encoding_Base4_2            70
perlbench   compression encoding_Base2_1            80
perlbench   compression encoding_Uncompressed       90

还有基准和技术的其他组合，但我保持简单。

我希望基准测试和技术的每种组合都能获得每种编码的值，并将每个编码与不同的数字相乘，然后求和。然后，我想使用新的统计信息名称为该值创建一个新行。

该函数类似于：compressed_size =（10 * 1 + 20 * 8 + 30 * 16 + ... + 90 * 64）

我还看到了有关将聚合与自定义函数一起使用的其他问题，但是我不确定如何根据值与统计类型之间的关系来区分每个值。

如果我对您的理解正确，那么您需要为表格中的每个统计信息应用不同的乘数吗？

这听起来像是case_when的应用程序：

library(dplyr)

df_summary <- df %>%
  mutate(
    stat_multiplier = case_when(
      stat == 'encoding_Zero' ~ 1,stat == 'encoding_Repeated_Values' ~ 8,stat == 'encoding_Base8_1' ~ 16,[...],stat == 'encoding_Uncompressed' ~ 64,TRUE ~ 1 # if none of the above is true,this would keep the value as-is instead of returning a NA
    )
  ) %>%
  group_by(benchmark,technique) %>%
  summarise(
     compressed_size = sum(value * stat_multiplier,na.rm = TRUE)
  )

我们可以创建一个函数，使该值递增以乘以并使用aggregate

apply_fun <- function(x) {
   sum(x * c(1,seq_along(x[-1]) * length(x[-1])))
}

aggregate(value~benchmark + technqiue,df,apply_fun)

#  benchmark   technqiue value
#1 perlbench compression 19210

该功能也可以与dplyr或data.table一起使用

library(dplyr)
df %>%  group_by(benchmark,technqiue) %>%  summarise(total = apply_fun(value))

library(data.table)
setDT(df)[,(total = apply_fun(value)),.(benchmark,technqiue)]

数据

df <- structure(list(benchmark = structure(c(1L,1L,1L),.Label = "perlbench",class = "factor"),technqiue = structure(c(1L,.Label = "compression",stat = structure(c(9L,7L,4L,5L,6L,2L,3L,8L),.Label = c("encoding_Base2_1","encoding_Base4_1","encoding_Base4_2","encoding_Base8_1","encoding_Base8_2","encoding_Base8_4","encoding_Repeated_Values","encoding_Uncompressed","encoding_Zero"),value = c(10L,20L,30L,40L,50L,60L,70L,80L,90L)),class = "data.frame",row.names = c(NA,-9L))

通过自定义函数使用聚合（使用基于另一列的值）

zhongyi9927 回答：通过自定义函数使用聚合（使用基于另一列的值）

大家都在问