R汇总天数并计算每天的特定观测值

         Day    Time                               Numbers
  6388 2017-02-01   10:43                               R33
7129 2017-02-04   15:32               N39.0,N39.0,N39.0
9689 2017-02-17   08:54    S72.11,S72.11,S72.11
6703 2017-02-02   18:55                               R11
9026 2017-02-13   17:34               S06.0,S06.0,S06.0
5013 2017-01-25   00:33        J18.1,J18.1,J18.1
5849 2017-01-29   17:57               I21.4,I21.4,I21.4
9245 2017-02-14   19:03               J18.0,J18.0,J18.0
1978 2017-01-09   21:23                             K59.0
5021 2017-01-25   02:46               I47.1,I47.1,I47.1
9258 2017-02-14   20:19                             S42.3
541  2017-01-03   11:44               I63.8,I63.8,I63.8
4207 2017-01-20   19:52            E83.58,E83.58,E83.58
8650 2017-02-11   18:39       R55,R55,R55
9442 2017-02-15   21:30                             K86.1
4186 2017-01-20   18:27                             S05.1
4231 2017-01-20   22:10                             M17.9
6847 2017-02-03   11:45                             L02.4
1739 2017-01-08   21:19                             S20.2
3685 2017-01-18   09:56                             G40.9
9497 2017-02-16   09:52                             S83.6
2563 2017-01-12   20:47            M13.16,M25.56,M25.56
9731 2017-02-17   13:10            B99,B99,N39.0
7759 2017-02-07   14:25                 R51,G43.0,G43.0
368  2017-01-02   15:05 T83.0,T83.0,N13.3,N13.6

我想以特殊方式汇总此df。我想计算多少个数字开始每天都是“ A”。我想要一个新的数据框,如下所示:

    Day          GroupA   GroupB   GroupC .....
1  2017-01-01       2        2       0    
2  2017-01-02       ..................      

GroupA表示以A开头的数字。如果在一行中有以A开头的多个数字,则算作一个。我的数字列的类别是性格。

> class(df[1,3])
[1] "character"
> df[1,3]
[1] "A41.8,A41.51,A41.51"**

我的问题是如何将聚合命令与计数结合起来。我的实际df更大,超过2年之久,所以我需要一个自动化的解决方案。

编辑:查看下面的数据

structure(list(Day= c("2017-01-07","2017-01-23","2017-01-08","2017-01-13","2017-02-10","2017-01-07","2017-01-24","2017-01-02","2017-01-03","2017-01-06","2017-01-11","2017-01-21","2017-01-10","2017-02-18","2017-01-31","2017-01-27","2017-01-09","2017-01-08"),Time= c("02:02","14:51","02:12","17:49","00:00","21:30","22:28","17:27","12:14","22:52","14:19","11:40","19:33","04:01","15:59","14:57","08:34","13:21","02:01","14:29","20:17","14:30","02:34","04:56","14:34"),Number= c("H10.9","K85.80,K85.20,K85.80,K85.20","R09.1","I10.90","I48.9,I48.0,I48.9,I48.0","A09.0,A09.0,R42,R42","H16.1","K92.1,K92.1,K92.1","K40.90,J12.2,J96.01,J12.2","B99,J15.8,J15.8","S01.55","M21.33","I10.01,I10.01,J44.81,J44.81","S00.95","B08.2","S05.1","M20.1","G40.2,S93.40,S93.40","M25.51","J44.19,J44.11,J44.19,J44.11","G40.9,G40.2,G40.2","E87.1,E87.1,J18.0","I10.91","R22.0","S06.5,S06.5,R06.88,S12.22"
)),.Names = c("Day","Time","Number"),row.names = c(1336L,4687L,1536L,2737L,8272L,1507L,4994L,400L,550L,1305L,2325L,4292L,2748L,2008L,9974L,2113L,6144L,5433L,4577L,2697L,8468L,1883L,4578L,1783L,1657L),class = "data.frame")
zhangyansong 回答:R汇总天数并计算每天的特定观测值

这是一个非常有趣的问题,需要深入研究。首先要做的是获取每行Number中每个集合中所有的唯一大写字母。 stringr::str_extract_all会为您提供与此正则表达式匹配的字符串向量的列表列,并且在从每个列表条目中获取唯一值之后,您将得到:

library(dplyr)
library(tidyr)

as_tibble(df1) %>%
  mutate(Day = lubridate::ymd(Day),letters = purrr::map(stringr::str_extract_all(Number,"[A-Z]"),unique)) %>%
  select(-Number) %>%
  head()
#> # A tibble: 6 x 3
#>   Day        Time  letters  
#>   <date>     <chr> <list>   
#> 1 2017-01-07 02:02 <chr [1]>
#> 2 2017-01-23 14:51 <chr [1]>
#> 3 2017-01-08 02:12 <chr [1]>
#> 4 2017-01-13 17:49 <chr [1]>
#> 5 2017-02-10 00:00 <chr [1]>
#> 6 2017-01-07 21:30 <chr [2]>

将其嵌套,这样您每个字母的每个日期和时间就有一行,然后计算每天每个字母的观察次数-会造成混乱,这里的顺序很重要。然后将其重塑为宽格式,以便每个组都有一列。

as_tibble(df1) %>%
  mutate(Day = lubridate::ymd(Day),unique)) %>%
  select(-Number) %>%
  unnest(letters) %>%
  count(Day,letters) %>%
  arrange(letters) %>%
  pivot_wider(names_from = letters,names_prefix = "group",values_from = n,values_fill = list(n = 0)) %>%
  head()
#> # A tibble: 6 x 12
#>   Day        groupA groupB groupE groupG groupH groupI groupJ groupK groupM
#>   <date>      <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
#> 1 2017-01-07      1      0      0      0      1      0      0      0      0
#> 2 2017-01-06      0      1      0      0      0      0      1      0      0
#> 3 2017-02-18      0      1      0      0      0      0      0      0      0
#> 4 2017-01-09      0      0      1      0      0      0      1      0      0
#> 5 2017-01-27      0      0      0      1      0      0      0      0      0
#> 6 2017-02-10      0      0      0      1      0      1      0      0      0
#> # … with 2 more variables: groupR <int>,groupS <int>

在带有数据样本的前几行中,没有任何2s,但是在数据帧中后面有一些。 (我尚不了解pivot_wider的订购方式,但如果需要,您可以在第二天之后安排。)

本文链接:https://www.f2er.com/3064378.html

大家都在问