问题简介
你好
我正在为我的实验室建立数据计划,该实验室将从一月开始进行一项盲目的临床试验。此任务的一部分是建立一些数据处理管道,以便一旦收集完所有数据,我们就可以快速运行代码。
我们正在使用的一种结果度量是行为测试。有人开发了一个JavaScript程序来自动对测试进行评分;但是,输出镜像5张桌子堆叠在一起。在某些stackoverflow用户的帮助下,我能够开发出一个管道,该管道将单个txt文件重组为一个数据帧,然后可以对其进行分析。我现在遇到的问题是如何同时处理所有文件。
我的想法是将所有文件加载到列表中,然后使用map.list或lapply操作列表中的每个元素。但是,我遇到了两个问题,下面将对此进行概述。
首先,这是用于处理单个数据框的代码和数据。
input <- c("Cognitive Screen","Subtest/Section\t\t\tScore\tT-Score","1. Line Bisection\t\t9\t53","2. Semantic Memory\t\t8\t51","3. Word Fluency\t\t\t1\t56*","4. Recognition Memory\t\t40\t59","5. Gesture Object Use\t\t2\t68","6. Arithmetic\t\t\t5\t49","Cognitive TOTAL\t\t\t65","","Language Battery","Part 1: Language Comprehension","Spoken Language\t\t\tScore\tT-Score","7. Spoken Words\t\t\t17\t45*","9. Spoken Sentences\t\t25\t53*","11. Spoken Paragraphs\t\t4\t60","Spoken Language TOTAL\t\t46\t49*","Written Language\t\tScore\tT-Score","8. Written Words\t\t14\t45*","10. Written Sentences\t\t21\t48*","Written Language TOTAL\t\t35\t46*","Part 2: Expressive Language","Repetition\t\t\tScore\tT-Score","12. Words\t\t\t24\t55*","13. Complex Words\t\t8\t52*","14. Nonwords\t\t\t10\t58","15. Digit Strings\t\t8\t55","16. Sentences\t\t\t12\t63","Repetition TOTAL\t\t62\t57*","17. Naming Objects\t\t30\t55*","18. Naming actions\t\t36\t63","3. Word Fluency\t\t\t12\t56*","Naming TOTAL\t\t\t56\t57*","Spoken Picture Description\tScore\tT-Score","19. Spoken Picture Description\t\t","Reading Aloud\t\t\tScore\tT-Score","20. Words\t\t\t25\t50*","21. Complex Words\t\t8\t51*","22. Function Words\t\t3\t62","23. Nonwords\t\t\t6\t51*","Reading TOTAL\t\t\t42\t50*","Writing\t\t\t\tScore\tT-Score","24. Writing: Copying\t\t26\t52","25. Writing Picture Names\t14\t53*","26. Writing to Dictation\t28\t68","Writing TOTAL\t\t\t68\t58*","Written Picture Description\tScore\tT-Score","27. Written Picture Description\t\t")
创建输入文件后,这里是我用来创建数据框的代码(我知道该数据框以字符为单位-稍后会解决此问题)
input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input,"^(.*?)\\s+Score.*")[,2,drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input,"^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores,row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]),-1]
# merge the header with the scores to give each section
data <- left_join(scores,header,by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title,"Subtest/Section","cognition")),subtest = data$V3,score = data$V4,t.score = data$V5)
head(data2)
好吧,现在可以处理多个数据文件。我的计划是将所有txt文件放在一个文件夹中,然后列出所有文件,如下所示:
# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp,readLines)
可复制的示例文件:
myfiles <- list(c("Cognitive Screen","27. Written Picture Description\t\t"),c("Cognitive Screen","27. Written Picture Description\t\t"))
这是麻烦开始的地方
我尝试在rlist包中使用lapply和list.map。首先,lapply似乎不喜欢管道功能,因此我尝试逐步进行。我也尝试为此步骤创建一个函数。
创建小标题。这行得通!
list_header <- lapply(myfiles,as.tibble)
出现错误-尝试开始数据操作
list_header2 <- lapply(list_header,str_match(list_header,drop = FALSE])
此代码行提供以下错误:
“ match.fun(FUN)中的错误: 'str_match(list_header,“ ^(。?)\ s + Score。”)[,2,drop = FALSE]'不是函数,字符或符号 另外:警告消息: 在stri_match_first_regex(string,pattern,opts_regex = opts(pattern))中: 参数不是原子向量;胁迫”
所以我尝试制作一个函数放在这里:
drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value,"^(.*?)\\s+Score.*")
}
list_header2 <- lapply(list_header,drop_rows)
现在我收到此错误:
“ match.fun(FUN)中的错误: 'str_match(list_header,“ ^(。?)\ s + Score。”)[,2,drop = FALSE]'不是函数,字符或符号 另外:警告消息: 在stri_match_first_regex(string,pattern,opts_regex = opts(pattern))中: 参数不是原子向量;胁迫”
摘要:
所提供的代码适用于加载单个txt文件的情况。但是,当我尝试运行代码以批处理多个列表时,我遇到了麻烦。如果有人能够提供一些有关如何解决此错误的见解**我认为**我将能够完成其余的工作。但是,如果您愿意帮助实现其余的代码,我将不反对。