如何使用dplyr按月划分事件间隔

我有一组由个人“标记”标识的事件,它们的位置“位置”事件“ StartDateTime_UTC”的开始和事件“ EndDateTime_UTC”的结束

前50行:(我将完整的dput放在下面)

   Tag     location   StartDateTime_UTC   EndDateTime_UTC    
   <chr>   <fct>      <dttm>              <dttm>             
 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19
 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31
 4 5004.24 Unresolved 2014-11-04 10:05:31 2014-11-04 10:08:06
 5 5004.24 IN         2014-11-04 10:08:06 2014-11-12 10:50:28
 6 5004.24 Unresolved 2014-11-12 10:50:28 2014-11-12 10:51:33
 7 5004.24 OUT        2014-11-12 10:51:33 2014-12-24 04:20:28
 8 5004.24 Unresolved 2014-12-24 04:20:28 2014-12-24 04:31:27
 9 5004.24 IN         2014-12-24 04:31:27 2015-02-08 22:30:27
10 5004.24 Unresolved 2015-02-08 22:30:27 2015-02-08 22:31:31
11 5004.24 OUT        2015-02-08 22:31:31 2015-03-01 11:18:32
12 5004.24 Unresolved 2015-03-01 11:18:32 2015-03-01 19:34:36
13 5004.24 IN         2015-03-01 19:34:36 2015-03-09 10:11:55
14 5004.24 Unresolved 2015-03-09 10:11:55 2015-03-09 10:48:02
15 5004.24 OUT        2015-03-09 10:48:02 2015-03-23 09:10:25
16 5004.24 Unresolved 2015-03-23 09:10:25 2015-03-23 09:18:07
17 5004.24 IN         2015-03-23 09:18:07 2015-04-09 23:19:26
18 5004.24 Unresolved 2015-04-09 23:19:26 2015-04-09 23:21:40
19 5004.24 OUT        2015-04-09 23:21:40 2015-04-21 20:11:59
20 5004.24 Unresolved 2015-04-21 20:11:59 2015-04-22 16:33:54
21 5004.24 IN         2015-04-22 16:33:54 2015-06-10 09:19:12
22 5010.04 IN         2014-05-09 18:26:00 2015-04-21 18:28:16
23 5011.03 IN         2016-06-17 15:29:00 2016-07-21 14:23:34
24 5011.07 IN         2014-06-05 17:58:00 2014-12-08 15:15:07
25 5017.06 IN         2013-09-20 03:00:00 2016-08-18 18:13:04
26 5025.22 IN         2013-06-11 14:20:00 2013-06-11 14:20:01
27 5025.26 IN         2016-12-29 22:38:00 2018-01-08 16:51:42
28 5032.24 IN         2014-07-18 18:04:00 2015-06-13 12:44:02
29 5038.04 IN         2014-12-16 17:51:07 2015-04-10 23:28:12
30 5039.03 IN         2016-04-01 17:26:00 2016-07-02 04:39:15
31 5039.07 IN         2014-04-08 19:28:04 2014-04-08 19:53:06
32 5045    IN         2013-08-07 17:00:00 2015-04-08 18:28:43
33 5053.26 IN         2016-11-14 18:12:00 2017-12-24 20:21:31
34 5066.04 IN         2014-02-25 21:32:00 2014-02-25 21:32:01
35 5067.07 IN         2014-10-02 16:39:00 2014-12-08 16:23:50
36 5073.06 IN         2013-05-22 14:45:00 2013-05-22 14:45:01
37 5074.16 IN         2016-02-11 19:02:00 2016-03-09 18:21:45
38 5074.16 Unresolved 2016-03-09 18:21:45 2016-03-09 19:50:22
39 5074.16 OUT        2016-03-09 19:50:22 2016-03-12 23:26:45
40 5088.11 IN         2015-06-02 16:29:00 2015-08-08 11:07:46
41 5094.04 IN         2014-10-17 16:04:00 2014-10-31 16:39:28
42 5101.06 IN         2013-09-20 03:00:00 2013-09-20 03:00:01
43 5101.06 Unresolved 2013-09-20 03:00:01 2014-02-03 07:15:08
44 5101.06 OUT        2014-02-03 07:15:08 2014-04-17 12:15:34
45 5116.24 IN         2014-03-07 20:59:00 2015-03-07 22:21:52
46 5123.03 IN         2016-01-14 18:55:00 2016-03-09 19:21:20
47 5123.03 Unresolved 2016-03-09 19:21:20 2016-03-09 20:24:36
48 5123.03 OUT        2016-03-09 20:24:36 2016-03-09 21:51:51
49 5123.07 IN         2014-01-21 18:54:00 2014-05-20 15:23:09
50 5150.04 IN         2014-05-22 16:17:00 2016-03-07 14:10:00

我想做的是将每个事件的总时间除以每年每个月每个人在每个位置(IN,OUT或未解决)花费的时间。

例如,上面的前3行将变为:

Tag        location    Month    Year   Duration_hrs
5004.24    IN          9        2014   18.50
5004.24    IN          10       2014   328.20
5004.24    Unresolved  10       2014   .06
5004.24    OUT         10       2014   345.74
5004.24    OUT         11       2014   82.09

我现在还不熟悉这种方法。有没有办法在dplyr或其他有用的软件包中执行此操作?

dput:

structure(list(Tag = c("5004.24","5004.24","5010.04","5011.03","5011.07","5017.06","5025.22","5025.26","5032.24","5038.04","5039.03","5039.07","5045","5053.26","5066.04","5067.07","5073.06","5074.16","5088.11","5094.04","5101.06","5116.24","5123.03","5123.07","5150.04","5157.06","5158.16","5165.22","5172.24","5178.31","5179.03","5179.07","5186.16","5200.11","5206.31","5214.16","5228.24","5235","5242.16","5249","5256.24","5263.03","5263.07","5270.16","5284.11","5290.31","5298.16","5312.11","5318.04","5325.06","5326.16","5340.11","5346.31","5347.07","5361.26","5368.24","5374.04","5375.03","5375.07","5381.06","5402.31","5403.07","5431.07","5438.16","5445.26","5452.11","5465.06","5480.24","5487.03","5487.07","5493.06","5501.22","5508.11","5514.31","5536.11","5542.31","5550.16","5557.22","5564.24","5570.04","5571.03","5571.07","5577","5585.26","5592.11","5599.15","5599.15"),location = structure(c(1L,3L,2L,1L,3L),.Label = c("IN","OUT","Unresolved"
),class = "factor"),StartDateTime_UTC = structure(c(1412055000,1413555139,1413555343,1415095531,1415095686,1415789428,1415789493,1419394828,1419395487,1423434627,1423434691,1425208712,1425238476,1425895915,1425898082,1427101825,1427102287,1428621566,1428621700,1429647119,1429720434,1399659960,1466177340,1401991080,1379646000,1370960400,1483051080,1405706640,1418752267,1459531560,1396985284,1375894800,1479147120,1393363920,1412267940,1369233900,1455217320,1457547705,1457553022,1433262540,1413561840,1379646001,1391411708,1394225940,1452797700,1457551280,1457555076,1390330440,1400775420,1373994000,1477593060,1374166980,1401299940,1447954320,1456854360,1393888860,1472234940,1489406928,1489519785,1493754358,1493768930,1493920595,1495185477,1495192085,1496331260,1496332668,1502330689,1504055913,1504056565,1504471969,1504484502,1446829680,1450458660,1478808060,1391558460,1483040340,1489421085,1489524492,1500821550,1500840149,1379430000,1397144701,1397145001,1418064072,1425499236,1425499420,1426550367,1426705419,1456257069,1456260660,1471541820,1407518280,1477072440,1443805740,1425491220,1426570277,1445317719,1445333300,1452196320,1464295591,1464296126,1465051334,1465057497,1438967700,1398371160,1380208260,1472238360,1473868480,1473870631,1500912421,1500916070,1501200103,1450372080,1429632420,1417450780,1425319859,1425319875,1425476848,1425477093,1425490129,1425490424,1425497300,1425497428,1425497915,1425499161,1426058600,1426059054,1458149820,1416942495,1399653120,1411058361,1411058655,1425581564,1425582554,1463157660,1399655460,1423859700,1426434141,1426438127,1418054706,1423434226,1423434240,1424431972,1424432003,1425225584,1425226167,1413566460,1463156700,1457457262,1464341296,1464345439,1438964700,1385483040,1392229993,1392230720,1397856480,1454698803,1457554029,1457557132,1392415980,1394995835,1394996053,1394999500,1394999554,1398891904,1398892060,1400546392,1410656812,1410657169,1412681302,1383667200,1394063203,1394064827,1394534414,1394546963,1394903943,1394904559,1404290854,1404295213,1450460220,1444934640,1443814080,1423862520,1425398953,1425403026,1482517054,1382107500,1406303160,1401292020,1459530420,1397852580,1375806600,1459355400,1436547420,1434128700,1455181152),class = c("POSIXct","POSIXt"),tzone = "UTC"),EndDateTime_UTC = structure(c(1413555139,1433927952,1429640896,1469111014,1418051707,1471543984,1370960401,1515430302,1434199442,1428708492,1467434355,1396986786,1428517723,1514146891,1393363921,1418055830,1369233901,1457825205,1439032066,1414773568,1397736934,1425766912,1457560311,1400599389,1457359800,1499495745,1477593061,1374166981,1401396204,1447954321,1462853353,1402068722,1493915356,1502328647,1510842039,1449871257,1467438121,1489417066,1399047138,1402529845,1501331232,1397146277,1426550808,1457306762,1471541821,1407522783,1489406240,1447974368,1426563298,1460371267,1475823163,1438969110,1429569640,1380208261,1501192164,1522212002,1455503331,1429632421,1435973349,1518286934,1424550436,1448200301,1467374409,1411930669,1485696012,1426439051,1426387273,1414861317,1465669005,1464345554,1438964701,1402254843,1403572223,1457575616,1412892331,1414516048,1450460221,1445287124,1451876530,1433699058,1503067936,1394903394,1406514937,1411706590,1460054294,1413725736,1375806601,1485062993,1448650763,1455181152,1455181271),tzone = "UTC")),class = c("tbl_df","tbl","data.frame"),row.names = c(NA,-200L))
kwhyill 回答:如何使用dplyr按月划分事件间隔

在这种情况下,我喜欢编写一个辅助函数,该函数需要一个开始时间和一个结束时间,并以给定的单位大小将其分成多个较小的周期:

library(tidyverse)
library(lubridate)

split_interval <- function(start,end,unit) {
  breaks <- seq(floor_date(start,unit),ceiling_date(end,by = unit)
  timeline <- c(start,breaks[breaks > start & breaks < end],end)
  tibble(.start = head(timeline,-1),.end = tail(timeline,-1))
}

split_interval(now(),now() + hours(2),unit = "hours")
#> # A tibble: 3 x 2
#>   .start              .end               
#>   <dttm>              <dttm>             
#> 1 2019-11-08 23:33:18 2019-11-09 00:00:00
#> 2 2019-11-09 00:00:00 2019-11-09 01:00:00
#> 3 2019-11-09 01:00:00 2019-11-09 01:33:18

有了这个,我们可以分割每行的间隔,得到一个列表列,其中包含每行落入的周期的数据帧:

tbl_split <- tbl %>% 
  mutate(periods = map2(StartDateTime_UTC,EndDateTime_UTC,split_interval,unit = "months"))

tbl_split
#> # A tibble: 3 x 5
#>   Tag     location   StartDateTime_UTC   EndDateTime_UTC     periods       
#>   <chr>   <fct>      <dttm>              <dttm>              <list>        
#> 1 5004.24 IN         2014-09-30 05:30:00 2014-10-17 14:12:19 <tibble [2 x ~
#> 2 5004.24 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43 <tibble [1 x ~
#> 3 5004.24 OUT        2014-10-17 14:15:43 2014-11-04 10:05:31 <tibble [2 x ~

(我仅使用tbl中的前三行来保持输出较小。最后是代码。)

然后,我们可以unnest()创建一个更长的数据帧,现在每个原始行都被分成多个较短周期的行:

tbl_split_long <- tbl_split %>% unnest(periods)

tbl_split_long %>% 
  select(location,.start,.end)
#> # A tibble: 5 x 3
#>   location   .start              .end               
#>   <fct>      <dttm>              <dttm>             
#> 1 IN         2014-09-30 05:30:00 2014-10-01 00:00:00
#> 2 IN         2014-10-01 00:00:00 2014-10-17 14:12:19
#> 3 Unresolved 2014-10-17 14:12:19 2014-10-17 14:15:43
#> 4 OUT        2014-10-17 14:15:43 2014-11-01 00:00:00
#> 5 OUT        2014-11-01 00:00:00 2014-11-04 10:05:31

现在,我们可以使用每个期间的开始来确定年和月,然后只需分组汇总即可得出最终结果:

tbl_split_long %>% 
  group_by(Tag,location,year = year(.start),month = month(.start)) %>% 
  summarize(duration = sum(difftime(.end,units = "hours")))
#> # A tibble: 5 x 5
#> # Groups:   Tag,year [3]
#>   Tag     location    year month duration          
#>   <chr>   <fct>      <dbl> <dbl> <drtn>            
#> 1 5004.24 IN          2014     9  18.50000000 hours
#> 2 5004.24 IN          2014    10 398.20527778 hours
#> 3 5004.24 OUT         2014    10 345.73805556 hours
#> 4 5004.24 OUT         2014    11  82.09194444 hours
#> 5 5004.24 Unresolved  2014    10   0.05666667 hours

数据:

tbl <- structure(
  list(
    Tag = c("5004.24","5004.24","5004.24"),location = structure(
      c(1L,3L,2L),.Label = c("IN","OUT","Unresolved"),class = "factor"
    ),StartDateTime_UTC = structure(
      c(1412055000,1413555139,1413555343),class = c("POSIXct","POSIXt"),tzone = "UTC"
    ),EndDateTime_UTC = structure(
      c(1413555139,1413555343,1415095531),tzone = "UTC"
    )
  ),row.names = c(NA,3L),class = c("tbl_df","tbl","data.frame")
)
本文链接:https://www.f2er.com/3135447.html

大家都在问