将文本文件中隐藏格式的数据导入R

前端之家收集整理的这篇文章主要介绍了将文本文件中隐藏格式的数据导入R前端之家小编觉得挺不错的,现在分享给大家,也给大家做个参考。
从一个可恶的当地政府网站下载了一堆数据.有77,000个项目条目看起来与以下内容完全相同,包含在纯文本文件中.我需要将这堆粪便导入R作为数据框:
  1. Instrument: 201301240005447
  2. Recorded: 01/24/2013
  3. Consideration: $150,125.00
  4. Document Type: MORTGAGES
  5. Pages: 17
  6. Grantor: BYRES,CONNIE R / BYRES,SCOTT
  7. Grantee: MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC
  8. Legal Description: * St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4
  9. *
  10. ---------------------------------/---------------------------------
  11. Instrument: 201301240005408
  12. Recorded: 01/24/2013
  13. Consideration: $65,124.00
  14. Document Type: MORTGAGES
  15. Pages: 17
  16. Grantor: SANNE,BETTY LOU / SANNE,KENNETH D
  17. Grantee: JPMORGAN CHASE BANK NA
  18. Legal Description: Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54
  19. *
  20. ---------------------------------/---------------------------------

有一些常用的字符向量,如“Instrument”,“Grantor”和“PrpId”.我究竟如何将其导入R?这会涉及解析或刮取某种类型吗?

不用说,我试图将此文件导入Excel但无法正常工作.我认为R会更好地工作,只需要弄清楚如何.谢谢

解决方法

我编写了一个非常通用的解析函数,可以处理分隔线和字段值分隔符的任何模式,指定为参数化正则表达式.它还可以选择从字段值中删除尾随空格,并将可变参数传递给构建结果data.frame的单个data.frame()调用.
  1. sectionedFieldLinesToFrame <- function(lines,divRE,sepRE,select,rtw=T,...) {
  2. divLineIndexes <- grep(perl=T,lines);
  3. ## remove possible leading and trailing divs,for robustness
  4. if (length(divLineIndexes)>0L && divLineIndexes[1L]==1L) {
  5. leadDivCount <- match(T,c(diff(divLineIndexes)!=1L,T));
  6. lines <- lines[-seq_len(leadDivCount)];
  7. divLineIndexes <- divLineIndexes[-seq_len(leadDivCount)]-leadDivCount;
  8. }; ## end if
  9. if (length(divLineIndexes)>0L && divLineIndexes[length(divLineIndexes)]==length(lines)) {
  10. trailDivCount <- match(T,c(rev(diff(divLineIndexes)!=1L),T));
  11. lines <- lines[-seq(to=length(lines),len=trailDivCount)];
  12. divLineIndexes <- divLineIndexes[-seq(to=length(divLineIndexes),len=trailDivCount)];
  13. }; ## end if
  14. ## get fields to extract
  15. if (missing(select)) {
  16. allFieldLineIndexes <- grep(perl=T,lines);
  17. fields <- unique(sub(perl=T,paste0(sepRE,'.*'),'',lines[allFieldLineIndexes]));
  18. } else {
  19. fields <- select;
  20. }; ## end if
  21. ## extract each field vector and build the data.frame
  22. do.call(data.frame,c(setNames(lapply(fields,function(field) {
  23. fieldLineIndexes <- grep(perl=T,paste0('^\\Q',field,'\\E',sepRE),lines);
  24. sectionIndexes <- findInterval(fieldLineIndexes,divLineIndexes); ## 0-based
  25. values <- sub(perl=T,paste0('^.*?',lines[fieldLineIndexes]);
  26. if (rtw) values <- sub(perl=T,'\\s+$',values);
  27. values[match(seq(0L,length(divLineIndexes)),sectionIndexes)];
  28. }),fields),...));
  29. }; ## end sectionedFieldLinesToFrame()

以下是如何使用它:

  1. fileName <- 'data.txt';
  2. divRE <- '^-+/-+$';
  3. sepRE <- ':\\s*';
  4. df <- sectionedFieldLinesToFrame(readLines(fileName),stringsAsFactors=F);
  5. str(df);
  6. ## 'data.frame': 2 obs. of 8 variables:
  7. ## $Instrument : chr "201301240005447" "201301240005408"
  8. ## $Recorded : chr "01/24/2013" "01/24/2013"
  9. ## $Consideration : chr "$150,125.00" "$65,124.00"
  10. ## $Document.Type : chr "MORTGAGES" "MORTGAGES"
  11. ## $Pages : chr "17" "17"
  12. ## $Grantor : chr "BYRES,SCOTT" "SANNE,KENNETH D"
  13. ## $Grantee : chr "MORTGAGE ELECTRONIC REGISTRATION SYSTEMS INC / QUICKEN LOANS INC" "JPMORGAN CHASE BANK NA"
  14. ## $Legal.Description: chr "* St:5495 MCNAMARA LN City:FLINT PrpId:1135532002 CC:11 T:8 R:7 S:35 ext:PT OF NE4" "Sub:WOODCROFT NO 1 Lt:188 St:2213 RADCLIFFE AVE City:FLINT PrpId:4024106003 CC:54"

您还可以指定select参数以准确选择要提取的字段:

  1. select <- c('Instrument','Pages','Grantor');
  2. df <- sectionedFieldLinesToFrame(readLines(fileName),stringsAsFactors=F);
  3. df;
  4. ## Instrument Pages Grantor
  5. ## 1 201301240005447 17 BYRES,SCOTT
  6. ## 2 201301240005408 17 SANNE,KENNETH D

我已经尽力使其尽可能健壮.它仔细处理可能的冗余前导和尾随分隔线,并正确处理节之间不一致字段的情况.

值得强调的是最后一点.所提供的所有其他解决方案对输入数据做出了非常脆弱的假设,要么每个部分恰好有8个字段始终以相同的顺序,要么每个部分都出现每个(可能是硬编码的)字段名称.如果违反了这个假设,那些解决方案就变得毫无用处.我的函数不对字段编号,名称或一致性做出任何假设.它动态检索任何部分中存在的所有字段名称,并构建每个字段的正确向量,生成NA元素,其中字段不存在于给定部分中.

这里有些例子:

  1. sectionedFieldLinesToFrame(character(),'^-$',':');
  2. ## data frame with 0 columns and 0 rows
  3. sectionedFieldLinesToFrame(rep('-',2L),':');
  4. ## data frame with 0 columns and 0 rows
  5. sectionedFieldLinesToFrame(c('A:a','-'),':');
  6. ## A
  7. ## 1 a
  8. sectionedFieldLinesToFrame(c('A:a','-','B:b',':');
  9. ## A B
  10. ## 1 a <NA>
  11. ## 2 <NA> b
  12. sectionedFieldLinesToFrame(c('A:a','B:c',':');
  13. ## A B
  14. ## 1 a b
  15. ## 2 <NA> c
  16. sectionedFieldLinesToFrame(c('A:a','A:d'),':');
  17. ## A B
  18. ## 1 a b
  19. ## 2 <NA> c
  20. ## 3 d <NA>
  21. sectionedFieldLinesToFrame(c('-','A:a','A:d','C:e',':');
  22. ## A B C
  23. ## 1 a b <NA>
  24. ## 2 <NA> c <NA>
  25. ## 3 d <NA> e
  26. sectionedFieldLinesToFrame(c('-',':');
  27. ## A B C
  28. ## 1 a b <NA>
  29. ## 2 <NA> <NA> <NA>
  30. ## 3 <NA> c <NA>
  31. ## 4 d <NA> e

猜你在找的HTML相关文章