R dplyr嵌套伪编码

2024-05-18 • 问答

我需要重新编码测试响应的数据集，以便在另一个应用程序（称为BLIMP的程序中插入缺失值）中使用。具体来说，我需要用虚拟代码表示测试项目和子量表的分配。

在这里，我创建一个数据框，以嵌套格式保存对两个人的10项测试的响应。这些数据是实际输入表的简化版本。

library(tidyverse)
df <- tibble(
  person = rep(101:102,each = 10),item = as.factor(rep(1:10,2)),response = sample(1:4,20,replace = T),scale = as.factor(rep(rep(1:2,each = 5),2))
) %>% mutate(
  scale_last = case_when(
    as.integer(scale) != lead(as.integer(scale)) | is.na(lead(as.integer(scale))) ~ 1,TRUE ~ NA_real_
  )
)

df的列包含：

person：这些人的ID号（每人10行）
item：每个人的测试项目1-10。请注意项目是如何在每个人中嵌套的。
response：每个项目的得分
scale：测试有两个分量表。项目1-5被分配给子秤1，项目6-10被分配给子秤2。
scale_last：此列中的代码1表示该项目是其分配的子比例尺中的最后一个项目。此特征在下面变得很重要。

然后我使用recipes包为商品创建伪代码。

library(recipes)
dum <- df %>% 
  recipe(~ .) %>% 
  step_dummy(item,one_hot = T) %>% 
  prep(training = df) %>%
  bake(new_data = df)
print(dum,width = Inf)

#   person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 item_X6 item_X7
#    <int>    <int> <fct>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
# 1    101        2 1             NA       1       0       0       0       0       0       0
# 2    101        3 1             NA       0       1       0       0       0       0       0
# 3    101        3 1             NA       0       0       1       0       0       0       0
# 4    101        1 1             NA       0       0       0       1       0       0       0
# 5    101        1 1              1       0       0       0       0       1       0       0
# 6    101        1 2             NA       0       0       0       0       0       1       0
# 7    101        3 2             NA       0       0       0       0       0       0       1
# 8    101        4 2             NA       0       0       0       0       0       0       0
# 9    101        2 2             NA       0       0       0       0       0       0       0
#10    101        4 2              1       0       0       0       0       0       0       0
#11    102        2 1             NA       1       0       0       0       0       0       0
#12    102        1 1             NA       0       1       0       0       0       0       0
#13    102        2 1             NA       0       0       1       0       0       0       0
#14    102        3 1             NA       0       0       0       1       0       0       0
#15    102        2 1              1       0       0       0       0       1       0       0
#16    102        1 2             NA       0       0       0       0       0       1       0
#17    102        4 2             NA       0       0       0       0       0       0       1
#18    102        2 2             NA       0       0       0       0       0       0       0
#19    102        4 2             NA       0       0       0       0       0       0       0
#20    102        3 2              1       0       0       0       0       0       0       0
#   item_X8 item_X9 item_X10
#     <dbl>   <dbl>    <dbl>
# 1       0       0        0
# 2       0       0        0
# 3       0       0        0
# 4       0       0        0
# 5       0       0        0
# 6       0       0        0
# 7       0       0        0
# 8       1       0        0
# 9       0       1        0
#10       0       0        1
#11       0       0        0
#12       0       0        0
#13       0       0        0
#14       0       0        0
#15       0       0        0
#16       0       0        0
#17       0       0        0
#18       1       0        0
#19       0       1        0
#20       0       0        1

输出显示在带有item_前缀的列中表示的项目伪代码。对于下游处理，我需要进一步的重新编码。在每个子秤中，必须相对于子秤的最后一项对商品进行伪编码。这是scale_last变量起作用的地方；此变量标识输出中需要重新编码的行。

例如，这些行中的第一行是第5行，人101的子尺度1中最后一项（项目5）的行。在这一行中，item_X5列的值需要从{ {1}}至1。在要重新编码的下一行（第10行）中，需要将0的值从item_X10编码为1。依此类推。

我正在为0动词的正确组合而努力，以实现这一目标。让我感到困扰的是，需要隔离要重新编码的特定行中的特定单元格。

在此先感谢您的帮助！

library(dplyr) dum %>% mutate_at(vars(starts_with("item")),~replace(.,scale_last == 1,0)) # A tibble: 20 x 14 # person response scale scale_last item_X1 item_X2 item_X3 item_X4 item_X5 # <int> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 101 2 1 NA 1 0 0 0 0 # 2 101 3 1 NA 0 1 0 0 0 # 3 101 1 1 NA 0 0 1 0 0 # 4 101 1 1 NA 0 0 0 1 0 # 5 101 3 1 1 0 0 0 0 0 # 6 101 4 2 NA 0 0 0 0 0 # 7 101 4 2 NA 0 0 0 0 0 # 8 101 3 2 NA 0 0 0 0 0 # 9 101 2 2 NA 0 0 0 0 0 #10 101 4 2 1 0 0 0 0 0 #11 102 2 1 NA 1 0 0 0 0 #12 102 1 1 NA 0 1 0 0 0 #13 102 4 1 NA 0 0 1 0 0 #14 102 4 1 NA 0 0 0 1 0 #15 102 4 1 1 0 0 0 0 0 #16 102 3 2 NA 0 0 0 0 0 #17 102 4 2 NA 0 0 0 0 0 #18 102 1 2 NA 0 0 0 0 0 #19 102 4 2 NA 0 0 0 0 0 #20 102 4 2 1 0 0 0 0 0 # … with 5 more variables: item_X6 <dbl>,item_X7 <dbl>,item_X8 <dbl>,# item_X9 <dbl>,item_X10 <dbl>

R dplyr嵌套伪编码

thinkphilo 回答：R dplyr嵌套伪编码

大家都在问