基于周范围的窗口总计列值(Impala)

给出如下表:

client_id   date            connections
---------------------------------------
121438297   2018-01-03      0
121438297   2018-01-08      1
121438297   2018-01-10      3
121438297   2018-01-12      1
121438297   2018-01-19      7
363863811   2018-01-18      0
363863811   2018-01-30      5
363863811   2018-02-01      4
363863811   2018-02-10      0

我正在寻找一种有效的方法来对当前行(当前行包括在总和中)之后的6天内发生的连接数进行求和,并按client_id进行分区,这将导致:

client_id   date            connections     connections_within_6_days
---------------------------------------------------------------------
121438297   2018-01-03      0               1        
121438297   2018-01-08      1               5     
121438297   2018-01-10      3               4     
121438297   2018-01-12      1               1                       
121438297   2018-01-19      7               7
363863811   2018-01-18      0               0
363863811   2018-01-30      5               9
363863811   2018-02-01      4               4
363863811   2018-02-10      0               0

问题:

  1. 我不想添加所有缺少的日期,然后执行滑动窗口以计数接下来的7行,因为我的表已经很大。

  2. 我正在使用Impala,但不支持range between interval '7' days following and current row


编辑:考虑到我需要将窗口大小更改为更大的数字(例如30天以上),我正在寻找一个通用的答案

chenhui651 回答:基于周范围的窗口总计列值(Impala)

这回答了问题的原始版本。

Impala不完全支持range between。不幸的是,这并没有太多选择。一种是将lag()与很多显式逻辑结合使用:

select t.*,( (case when lag(date,6) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,6) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date,5) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,5) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date,4) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,4) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date,3) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,3) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date,2) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,2) over (partition by client_id order by date)
               else 0
          end) +
         (case when lag(date,1) over (partition by client_id order by date) = date - interval 6 day
               then lag(connections,1) over (partition by client_id order by date)
               else 0
          end) +
         connections
        ) as connections_within_6_days         
from t;

不幸的是,这并不能很好地概括。如果您希望在很长一段时间内,都可能要问另一个问题。

本文链接:https://www.f2er.com/3073874.html

大家都在问