在我的应用函数中使用 dplyr 的 slice_sample() 的正确方法是什么？

2024-05-17 • 问答

在下面的代码中，我模拟了增加样本大小的掷骰子，并计算了每个样本大小的平均掷骰子。我的 lapply 函数有效，但我对此感到不舒服，因为我知道 sample_n 不是 dplyr 函数并且已被 slice_sample 取代。我想使用 dplyr 解决方案而不是 lapply 中的 sample_n() 使我的代码更好。我想我可能在 lapply 中有其他语法错误。代码如下：

   #Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs) 

#Simulate dice rolls for each of these sample sizes and record the average of the rolls

sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size


output <- lapply(X=sample_sizes,FUN = function(var){ 
       obs = sample_n(dice_df,var,replace=TRUE) 
       sample_mean = mean(obs$dice)
       new.df <- data.frame(sample_mean,var)
      return(new.df)
            })

最后一步是计算与预期值 3.5 的差异。我想要一列，显示 3.5 和样本平均值之间的差异。我们应该看到差异随着样本量的增加而减少。

output <- output %>%
      mutate(difference = across(sample_mean,~3.5 - .x))

当我运行这个时，它会抛出这个错误：

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "list"

我曾尝试使用 sapply，但出现类似错误：no applicable method for 'mutate' applied to an object of class "c('matrix','array','list')"

如果有帮助，这是我使用 slice_sample 的失败尝试：

output <- lapply(X=sample_sizes,FUN = function(...){ 
       obs = slice_sample(dice_df,...,.preserve=TRUE) 
       sample_mean = mean(obs$dice)
       new.df <- data.frame(sample_mean,...)
      return(new.df)
            })

我收到此错误：Error: '...' used in an incorrect context

n 的 slice_sample 参数对应于 sample_n 的 size 参数。

为了计算您的 output 列表的差异，我们可以使用 purrr::map 而不是 dplyr::across。

library(dplyr)
library(purrr)

set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)

#Simulate dice rolls for each of these sample sizes and record the average of the rolls

sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size

output <- lapply(X=sample_sizes,FUN = function(var){
  obs = slice_sample(dice_df,n  = var,replace=TRUE)
  sample_mean = mean(obs$dice)
  new.df <- data.frame(sample_mean,var)
  return(new.df)
})

output %>%
  map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#> 
#> [[2]]
#> [1] 0.42
#> 
#> [[3]]
#> [1] -0.04
#> 
#> [[4]]
#> [1] -0.34
#> 
#> [[5]]
#> [1] 0.025
#> 
#> [[6]]
#> [1] 0.0317
#> 
#> [[7]]
#> [1] 0.00416
#> 
#> [[8]]
#> [1] -2.6e-05
#> 
#> [[9]]
#> [1] -4.405e-05

^{由 reprex package (v0.3.0) 于 2021 年 8 月 2 日创建}

或者，我们可以使用 purrr::map_df 并在每个 diff 中添加一行 tibble，正如 Martin Gal 在评论中提出的那样：

output %>%
  map_df(~ tibble(.x,diff = 3.5 - .x$sample_mean))

#> # A tibble: 9 x 3
#>   sample_mean       var       diff
#>         <dbl>     <dbl>      <dbl>
#> 1        2.6         10  0.9      
#> 2        3.28        25  0.220    
#> 3        3.66        50 -0.160    
#> 4        3.5        100  0        
#> 5        3.53      1000 -0.0270   
#> 6        3.50     10000 -0.00180  
#> 7        3.50    100000 -0.00444  
#> 8        3.50   1000000 -0.000226 
#> 9        3.50 100000000 -0.0000669

输出只是 list 中的单行 data.frame 元素。我们可以将它们与 bind_rows 绑定并简单地减去一次而不是多次这样做

library(dplyr)
bind_rows(output) %>% 
    mutate(difference = 3.5 - sample_mean )
  sample_mean       var  difference
1    3.500000        10  0.00000000
2    2.800000        25  0.70000000
3    3.440000        50  0.06000000
4    3.510000       100 -0.01000000
5    3.495000      1000  0.00500000
6    3.502200     10000 -0.00220000
7    3.502410    100000 -0.00241000
8    3.498094   1000000  0.00190600
9    3.500183 100000000 -0.00018332

这是一个基本的 R 方法 -

transform(do.call(rbind,output),difference = 3.5 - sample_mean)

#  sample_mean       var difference
#1        3.80        10  -0.300000
#2        3.44        25   0.060000
#3        3.78        50  -0.280000
#4        3.30       100   0.200000
#5        3.52      1000  -0.015000
#6        3.50     10000  -0.004200
#7        3.50    100000  -0.004370
#8        3.50   1000000   0.002696
#9        3.50 100000000   0.000356

如果您只需要 difference 值，您可以这样做 -

3.5 - sapply(output,`[[`,'sample_mean')

在我的应用函数中使用 dplyr 的 slice_sample() 的正确方法是什么？

comind 回答：在我的应用函数中使用 dplyr 的 slice_sample() 的正确方法是什么？

大家都在问