读取分隔的.txt文件，在R中具有多个散布的标题

2024-05-19 • 问答

我正在尝试打开和清理R中的大量海洋学数据集，在该数据集中，观测站信息散布在观测数据块之间，作为标题：

$
 2008    1  774  8 17  5 11  2   78.4952    6.0375 30  7    1.2 -999.0 -9 -9 -9 -9 4868.8 2017  0  7114
    2.0    6.0297   35.0199   34.4101    2.0 11111
    3.0    6.0279   35.0201   34.4091    3.0 11111
    4.0    6.0272   35.0203   34.4091    4.0 11111
    5.0    6.0273   35.0204   34.4097    4.9 11111
    6.0    6.0274   35.0205   34.4104    5.9 11111
$
 2008    1  777  8 17 12  7 25   78.4738    8.3510 27  6    4.1 -999.0  3  7  2  0 4903.8 1570  0  7114
    3.0    6.4129   34.5637   34.3541    3.0 11111
    4.0    6.4349   34.5748   34.3844    4.0 11111
    5.0    6.4803   34.5932   34.4426    4.9 11111
    6.0    6.4139   34.5624   34.3552    5.9 11111
    7.0    6.5079   34.6097   34.4834    6.9 11111

每个$之后是一行，其中包含站点数据（例如，年，...，lat，lon，日期，时间），然后跟随几行，其中包含在该站点采样的观测值（例如，深度，温度，盐度等）。

我想将测站数据添加到观测中，以便每个变量都是一列每个观察结果都是一行，就像这样：

2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    2   6.0297  35.0199 34.4101 2   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    3   6.0279  35.0201 34.4091 3   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    4   6.0272  35.0203 34.4091 4   11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    5   6.0273  35.0204 34.4097 4.9 11111
2008    1   774 8   17  5   11  2   78.4952 6.0375  30  7   1.2 -999    6   6.0274  35.0205 34.4104 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    3   6.4129  34.5637 34.3541 3   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    4   6.4349  34.5748 34.3844 4   11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    5   6.4803  34.5932 34.4426 4.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    6   6.4139  34.5624 34.3552 5.9 11111
2008    1   777 8   17  12  7   25  78.4738 8.351   27  6   4.1 -999    7   6.5079  34.6097 34.4834 6.9 11111

此解决方案涉及很多，并且依赖于多个Tidyverse库和功能的知识。我不确定它是否可以满足您的需求，但是对于您发布的示例确实可以。但是，我认为折叠块，创建函数以解析较小的块，然后展开结果的方法非常有用。

第一部分涉及找到“ $”标记，将以下几行分组在一起，然后将数据块“嵌套”在一起。然后，我们有一个只有几行的数据框-每节一个。

library(tidyverse)
txt_lns <- readLines("ocean-sample.txt") 

txt <- tibble(txt = txt_lns)

# Start by finding new sections,and nesting the data
nested_txt <- txt %>%
  mutate(row_number = row_number()) %>%
  mutate(new_section = str_detect(txt,"\\$")) %>%            # Mark new sections
  mutate(starting = ifelse(new_section,row_number,NA)) %>%  # Index with row num
  tidyr::fill(starting) %>%                                   # Fill index down
                                                              # where missing
  select(-new_section) %>%                                    # Clean up
  filter(!str_detect(txt,"\\$")) %>%                         
  nest(data = c(txt,row_number))                             # "Nest" the data

# Take a quick look
nested_txt

然后，我们需要能够处理那些嵌套的块。此处的例程通过识别标头行，然后将字段分为自己的数据帧来解析这些块。在这里，标题行与较短的较小行的逻辑不同。

# Deal with the records within a section
parse_inner_block <- function(x,header_ind) {
  if (header_ind) {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>%
      # Separate the header row into 22 variables
      separate(txt,into = LETTERS[1:22],sep = "\\s+")
  } else {
    df <- x %>%
      mutate(txt = str_trim(txt)) %>% 
      # Separate the lesser rows into 6 variables
      separate(txt,into  = letters[1:6],sep = "\\s+")
  }
  return(df)
}

parse_outer_block <- function(x) {
  df <- x %>%
    # Determine if it's a header row with 22 variables or lesser row with 6
    mutate(leading_row = (row_number == min(row_number))) %>%
    # Fold by header row vs. not
    nest(data = c(txt,row_number)) %>%
    # Create data frames for both header and lesser rows
    mutate(processed = purrr::map2(data,leading_row,parse_inner_block)) %>%
    unnest(processed) %>%
    # Copy header row values to lesser rows
    tidyr::fill(A:V) %>%
    # Drop header row
    filter(!leading_row)
  return(df)
}

然后，我们可以将它们放在一起-从嵌套数据开始，处理每个块，取消嵌套返回的字段，并准备完整的输出。

# Actually put all this together and generate an output dataframe
output <- nested_txt %>%
  mutate(proc_out = purrr::map(data,parse_outer_block)) %>%
  select(-data) %>%
  unnest(proc_out) %>%
  select(-starting,-leading_row,-data,-row_number)

output

希望有帮助。我建议您也参考一些purrr教程，以解决一些类似的问题。

这更简单，仅取决于基数R。我假设您已经首先使用x <- readLines(....)读取了文本文件：

start <- which(x == "$") + 1             # Find header indices
rows <- diff(c(start,length(x)+2)) - 2  # Find number of lines per group
# Function to read header and rows and cbind
getdata <- function(begin,end) {
    cbind(read.table(text=x[begin]),read.table(text=x[(begin+1):(begin+end)]))
}
dta.list <- lapply(1:(length(start)),function(i) getdata(start[i],rows[i]))
dta.df <- do.call(rbind,dta.list)

这适用于您帖子中包含的两个组。您将需要修复列名，因为在开头和结尾重复了V1-V6。

读取分隔的.txt文件，在R中具有多个散布的标题

wenjin666666 回答：读取分隔的.txt文件，在R中具有多个散布的标题

大家都在问