问题简介

你好

我正在为我的实验室建立数据计划，该实验室将从一月开始进行一项盲目的临床试验。此任务的一部分是建立一些数据处理管道，以便一旦收集完所有数据，我们就可以快速运行代码。

我们正在使用的一种结果度量是行为测试。有人开发了一个JavaScript程序来自动对测试进行评分；但是，输出镜像5张桌子堆叠在一起。在某些stackoverflow用户的帮助下，我能够开发出一个管道，该管道将单个txt文件重组为一个数据帧，然后可以对其进行分析。我现在遇到的问题是如何同时处理所有文件。

我的想法是将所有文件加载到列表中，然后使用map.list或lapply操作列表中的每个元素。但是，我遇到了两个问题，下面将对此进行概述。

首先，这是用于处理单个数据框的代码和数据。

input <- c("Cognitive Screen","Subtest/Section\t\t\tScore\tT-Score","1. Line Bisection\t\t9\t53","2. Semantic Memory\t\t8\t51","3. Word Fluency\t\t\t1\t56*","4. Recognition Memory\t\t40\t59","5. Gesture Object Use\t\t2\t68","6. Arithmetic\t\t\t5\t49","Cognitive TOTAL\t\t\t65","","Language Battery","Part 1: Language Comprehension","Spoken Language\t\t\tScore\tT-Score","7. Spoken Words\t\t\t17\t45*","9. Spoken Sentences\t\t25\t53*","11. Spoken Paragraphs\t\t4\t60","Spoken Language TOTAL\t\t46\t49*","Written Language\t\tScore\tT-Score","8. Written Words\t\t14\t45*","10. Written Sentences\t\t21\t48*","Written Language TOTAL\t\t35\t46*","Part 2: Expressive Language","Repetition\t\t\tScore\tT-Score","12. Words\t\t\t24\t55*","13. Complex Words\t\t8\t52*","14. Nonwords\t\t\t10\t58","15. Digit Strings\t\t8\t55","16. Sentences\t\t\t12\t63","Repetition TOTAL\t\t62\t57*","17. Naming Objects\t\t30\t55*","18. Naming actions\t\t36\t63","3. Word Fluency\t\t\t12\t56*","Naming TOTAL\t\t\t56\t57*","Spoken Picture Description\tScore\tT-Score","19. Spoken Picture Description\t\t","Reading Aloud\t\t\tScore\tT-Score","20. Words\t\t\t25\t50*","21. Complex Words\t\t8\t51*","22. Function Words\t\t3\t62","23. Nonwords\t\t\t6\t51*","Reading TOTAL\t\t\t42\t50*","Writing\t\t\t\tScore\tT-Score","24. Writing: Copying\t\t26\t52","25. Writing Picture Names\t14\t53*","26. Writing to Dictation\t28\t68","Writing TOTAL\t\t\t68\t58*","Written Picture Description\tScore\tT-Score","27. Written Picture Description\t\t")

创建输入文件后，这里是我用来创建数据框的代码（我知道该数据框以字符为单位-稍后会解决此问题）

input <- read_lines('Example_data')

# do the match and keep only the second column
header <- as_tibble(str_match(input,"^(.*?)\\s+Score.*")[,2,drop = FALSE])
colnames(header) <- 'title'

# add index to the list so we can match the scores that come after
header <- header %>%
  mutate(row = row_number()) %>%
  fill(title)  # copy title down

# pull off the scores on the numbered rows
scores <- str_match(input,"^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
  mutate(row = row_number())
scores3 <- mutate(scores,row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]),-1]

# merge the header with the scores to give each section
data <- left_join(scores,header,by = 'row'
)

#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title,"Subtest/Section","cognition")),subtest = data$V3,score = data$V4,t.score = data$V5)

head(data2)

好吧，现在可以处理多个数据文件。我的计划是将所有txt文件放在一个文件夹中，然后列出所有文件，如下所示：

# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp,readLines)

可复制的示例文件：

myfiles <- list(c("Cognitive Screen","27. Written Picture Description\t\t"),c("Cognitive Screen","27. Written Picture Description\t\t"))

这是麻烦开始的地方

我尝试在rlist包中使用lapply和list.map。首先，lapply似乎不喜欢管道功能，因此我尝试逐步进行。我也尝试为此步骤创建一个函数。

创建小标题。这行得通！

list_header <- lapply(myfiles,as.tibble)

出现错误-尝试开始数据操作

list_header2 <- lapply(list_header,str_match(list_header,drop = FALSE])

此代码行提供以下错误：

“ match.fun（FUN）中的错误： 'str_match（list_header，“ ^（。？）\ s + Score。”）[，2，drop = FALSE]'不是函数，字符或符号另外：警告消息：在stri_match_first_regex（string，pattern，opts_regex = opts（pattern））中：参数不是原子向量；胁迫”

所以我尝试制作一个函数放在这里：

drop_rows <- function(df) {
  new_df <- str_match_all(df[[1:3]]$value,"^(.*?)\\s+Score.*")
}

list_header2 <- lapply(list_header,drop_rows)

现在我收到此错误：

摘要：

所提供的代码适用于加载单个txt文件的情况。但是，当我尝试运行代码以批处理多个列表时，我遇到了麻烦。如果有人能够提供一些有关如何解决此错误的见解**我认为**我将能够完成其余的工作。但是，如果您愿意帮助实现其余的代码，我将不反对。

我不是尝试调试您的代码，而是尝试找到一种适用于您的示例数据的解决方案。以下似乎适用于单个向量和向量列表：

library(tidyverse)

text_to_tibb <- function(char_vec){
    str_split(char_vec,"\t") %>% 
        map_dfr(~ .[nchar(.) > 0] %>% matrix(.,nrow = T) %>%
                    as_tibble
                ) %>% 
        filter(!is.na(V2),!str_detect(V1,"TOTAL")) %>%
        mutate(title = str_detect(V1,"^\\d+\\.",negate = T),group = cumsum(title)
               ) %>% 
        group_by(group) %>%
        mutate(domain = first(V1)) %>% 
        filter(!title) %>% 
        ungroup() %>% 
        select(domain,V1,V2,V3,-title,-group) %>% 
        mutate(V1 = str_remove(V1,"^\\d+\\. "),domain = str_replace(domain,"Subtest.*","Cognition")) %>% 
        rename(subtest = V1,score = V2,t_score = V3)
}

如果您在input变量上运行它，则应该得到一个清晰的提示：

text_to_tibb(input)

#### OUTPUT ####
# A tibble: 26 x 4
   domain           subtest            score t_score
   <chr>            <chr>              <chr> <chr>  
 1 Cognition        Line Bisection     9     53     
 2 Cognition        Semantic Memory    8     51     
 3 Cognition        Word Fluency       1     56*    
 4 Cognition        Recognition Memory 40    59     
 5 Cognition        Gesture Object Use 2     68     
 6 Cognition        Arithmetic         5     49     
 7 Spoken Language  Spoken Words       17    45*    
 8 Spoken Language  Spoken Sentences   25    53*    
 9 Spoken Language  Spoken Paragraphs  4     60     
10 Written Language Written Words      14    45*    
# … with 16 more rows

它也可以在您上面包含的向量列表中使用。只需使用lapply或purrr::map：

map(myfiles,text_to_tibb)

如果您认为某个表中可能存在一些不一致之处，则可以尝试safely：

safe_text_to_tibb <- safely(text_to_tibb)

map(myfiles,safe_text_to_tibb)

在R

问题简介

创建输入文件后，这里是我用来创建数据框的代码（我知道该数据框以字符为单位-稍后会解决此问题）

好吧，现在可以处理多个数据文件。我的计划是将所有txt文件放在一个文件夹中，然后列出所有文件，如下所示：

可复制的示例文件：

这是麻烦开始的地方

创建小标题。这行得通！

出现错误-尝试开始数据操作

此代码行提供以下错误：

所以我尝试制作一个函数放在这里：

现在我收到此错误：

摘要：

benbenmail 回答：在R

在R

问题简介

创建输入文件后，这里是我用来创建数据框的代码（我知道该数据框以字符为单位-稍后会解决此问题）

好吧，现在可以处理多个数据文件。我的计划是将所有txt文件放在一个文件夹中，然后列出所有文件，如下所示：

可复制的示例文件：

这是麻烦开始的地方

创建小标题。这行得通！

出现错误-尝试开始数据操作

此代码行提供以下错误：

所以我尝试制作一个函数放在这里：

现在我收到此错误：

摘要：

benbenmail 回答：在R

大家都在问