在R

问题简介

你好

我正在为我的实验室建立数据计划,该实验室将从一月开始进行一项盲目的临床试验。此任务的一部分是建立一些数据处理管道,以便一旦收集完所有数据,我们就可以快速运行代码。

我们正在使用的一种结果度量是行为测试。有人开发了一个JavaScript程序来自动对测试进行评分;但是,输出镜像5张桌子堆叠在一起。在某些stackoverflow用户的帮助下,我能够开发出一个管道,该管道将单个txt文件重组为一个数据帧,然后可以对其进行分析。我现在遇到的问题是如何同时处理所有文件。

我的想法是将所有文件加载到列表中,然后使用map.list或lapply操作列表中的每个元素。但是,我遇到了两个问题,下面将对此进行概述。

首先,这是用于处理单个数据框的代码和数据。

input <- c("Cognitive Screen","Subtest/Section\t\t\tScore\tT-Score","1. Line Bisection\t\t9\t53","2. Semantic Memory\t\t8\t51","3. Word Fluency\t\t\t1\t56*","4. Recognition Memory\t\t40\t59","5. Gesture Object Use\t\t2\t68","6. Arithmetic\t\t\t5\t49","Cognitive TOTAL\t\t\t65","","Language Battery","Part 1: Language Comprehension","Spoken Language\t\t\tScore\tT-Score","7. Spoken Words\t\t\t17\t45*","9. Spoken Sentences\t\t25\t53*","11. Spoken Paragraphs\t\t4\t60","Spoken Language TOTAL\t\t46\t49*","Written Language\t\tScore\tT-Score","8. Written Words\t\t14\t45*","10. Written Sentences\t\t21\t48*","Written Language TOTAL\t\t35\t46*","Part 2: Expressive Language","Repetition\t\t\tScore\tT-Score","12. Words\t\t\t24\t55*","13. Complex Words\t\t8\t52*","14. Nonwords\t\t\t10\t58","15. Digit Strings\t\t8\t55","16. Sentences\t\t\t12\t63","Repetition TOTAL\t\t62\t57*","17. Naming Objects\t\t30\t55*","18. Naming actions\t\t36\t63","3. Word Fluency\t\t\t12\t56*","Naming TOTAL\t\t\t56\t57*","Spoken Picture Description\tScore\tT-Score","19. Spoken Picture Description\t\t","Reading Aloud\t\t\tScore\tT-Score","20. Words\t\t\t25\t50*","21. Complex Words\t\t8\t51*","22. Function Words\t\t3\t62","23. Nonwords\t\t\t6\t51*","Reading TOTAL\t\t\t42\t50*","Writing\t\t\t\tScore\tT-Score","24. Writing: Copying\t\t26\t52","25. Writing Picture Names\t14\t53*","26. Writing to Dictation\t28\t68","Writing TOTAL\t\t\t68\t58*","Written Picture Description\tScore\tT-Score","27. Written Picture Description\t\t")  

创建输入文件后,这里是我用来创建数据框的代码(我知道该数据框以字符为单位-稍后会解决此问题)

input <- read_lines('Example_data')

# do the match and keep only the second column
header <- as_tibble(str_match(input,"^(.*?)\\s+Score.*")[,2,drop = FALSE])
colnames(header) <- 'title'

# add index to the list so we can match the scores that come after
header <- header %>%
  mutate(row = row_number()) %>%
  fill(title)  # copy title down

# pull off the scores on the numbered rows
scores <- str_match(input,"^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
  mutate(row = row_number())
scores3 <- mutate(scores,row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]),-1]

# merge the header with the scores to give each section
data <- left_join(scores,header,by = 'row'
)

#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title,"Subtest/Section","cognition")),subtest = data$V3,score = data$V4,t.score = data$V5)

head(data2) 

好吧,现在可以处理多个数据文件。我的计划是将所有txt文件放在一个文件夹中,然后列出所有文件,如下所示:

# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp,readLines)

可复制的示例文件:

myfiles <- list(c("Cognitive Screen","27. Written Picture Description\t\t"),c("Cognitive Screen","27. Written Picture Description\t\t")) 

这是麻烦开始的地方

我尝试在rlist包中使用lapply和list.map。首先,lapply似乎不喜欢管道功能,因此我尝试逐步进行。我也尝试为此步骤创建一个函数。

创建小标题。这行得通!

list_header <- lapply(myfiles,as.tibble)

出现错误-尝试开始数据操作

list_header2 <- lapply(list_header,str_match(list_header,drop = FALSE])

此代码行提供以下错误:

“ match.fun(FUN)中的错误:   'str_match(list_header,“ ^(。?)\ s + Score。”)[,2,drop = FALSE]'不是函数,字符或符号 另外:警告消息: 在stri_match_first_regex(string,pattern,opts_regex = opts(pattern))中:   参数不是原子向量;胁迫”

所以我尝试制作一个函数放在这里:

drop_rows <- function(df) {
  new_df <- str_match_all(df[[1:3]]$value,"^(.*?)\\s+Score.*")
}

list_header2 <- lapply(list_header,drop_rows)

现在我收到此错误:

“ match.fun(FUN)中的错误:   'str_match(list_header,“ ^(。?)\ s + Score。”)[,2,drop = FALSE]'不是函数,字符或符号 另外:警告消息: 在stri_match_first_regex(string,pattern,opts_regex = opts(pattern))中:   参数不是原子向量;胁迫”

摘要:

所提供的代码适用于加载单个txt文件的情况。但是,当我尝试运行代码以批处理多个列表时,我遇到了麻烦。如果有人能够提供一些有关如何解决此错误的见解**我认为**我将能够完成其余的工作。但是,如果您愿意帮助实现其余的代码,我将不反对。

benbenmail 回答:在R

我不是尝试调试您的代码,而是尝试找到一种适用于您的示例数据的解决方案。以下似乎适用于单个向量和向量列表:

library(tidyverse)

text_to_tibb <- function(char_vec){
    str_split(char_vec,"\t") %>% 
        map_dfr(~ .[nchar(.) > 0] %>% matrix(.,nrow = T) %>%
                    as_tibble
                ) %>% 
        filter(!is.na(V2),!str_detect(V1,"TOTAL")) %>%
        mutate(title = str_detect(V1,"^\\d+\\.",negate = T),group = cumsum(title)
               ) %>% 
        group_by(group) %>%
        mutate(domain = first(V1)) %>% 
        filter(!title) %>% 
        ungroup() %>% 
        select(domain,V1,V2,V3,-title,-group) %>% 
        mutate(V1 = str_remove(V1,"^\\d+\\. "),domain = str_replace(domain,"Subtest.*","Cognition")) %>% 
        rename(subtest = V1,score = V2,t_score = V3)
}

如果您在input变量上运行它,则应该得到一个清晰的提示:

text_to_tibb(input)

#### OUTPUT ####
# A tibble: 26 x 4
   domain           subtest            score t_score
   <chr>            <chr>              <chr> <chr>  
 1 Cognition        Line Bisection     9     53     
 2 Cognition        Semantic Memory    8     51     
 3 Cognition        Word Fluency       1     56*    
 4 Cognition        Recognition Memory 40    59     
 5 Cognition        Gesture Object Use 2     68     
 6 Cognition        Arithmetic         5     49     
 7 Spoken Language  Spoken Words       17    45*    
 8 Spoken Language  Spoken Sentences   25    53*    
 9 Spoken Language  Spoken Paragraphs  4     60     
10 Written Language Written Words      14    45*    
# … with 16 more rows

它也可以在您上面包含的向量列表中使用。只需使用lapplypurrr::map

map(myfiles,text_to_tibb)

如果您认为某个表中可能存在一些不一致之处,则可以尝试safely

safe_text_to_tibb <- safely(text_to_tibb)

map(myfiles,safe_text_to_tibb)
本文链接:https://www.f2er.com/3164348.html

大家都在问