在Hive中加入SCD Type 2表的性能

我在Hive中有2个表,这些表使用SCD Type 2(https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row)进行管理。基本上,每条记录都有两列valid_start_date和valid_end_date来指定有效期限。

2张桌子:

员工(请注意,employee_id 1于2010-10-20更改了地址)

employee_id    employee_name    employee_address    valid_start_date    valid_end_date
1              Bob              123 XXX             2010-01-01          2010-10-20 
2              Alice            999 YYY             2010-01-01          9999-12-31
1              Bob              567 ZZZ             2010-10-20          9999-12-31

工作场所(请注意,employee_id 1于2010年12月31日发生变更)

employee_id    building_id    valid_start_date    valid_end_date
1              1              2010-01-01          2010-12-31
1              2              2010-12-31          9999-12-31
2              2              2010-01-01          9999-12-31

我想加入他们并获得结果

employee_id    employee_name    employee_address    building_id    valid_start_date    valid_end_date
1              Bob              123 XXX             1              2010-01-01          2010-10-20
1              Bob              567 ZZZ             1              2010-10-20          2010-12-31
1              Bob              567 ZZZ             2              2010-12-31          9999-12-31
2              Alice            999 YYY             1              2010-01-01          9999-12-31

这是我想到的查询

WITH
  valid_dates AS (
    SELECT
        employee_id,valid_start_date    valid_date
    FROM
      employee
    UNION
    SELECT
        employee_id,valid_end_date      valid_date
    FROM
      employee
    UNION
    SELECT
        employee_id,valid_start_date    valid_date
    FROM
      workplace
    UNION
    SELECT
        employee_id,valid_end_date      valid_date
    FROM
      workplace
  ),valid_date_ranges AS (
    SELECT
        employee_id,valid_start_date,valid_end_date
    FROM (
      SELECT
          employee_id,valid_date                     valid_start_date,LEAD(valid_date,1) OVER (
            PARTITION BY employee_id
            ORDER BY valid_date)         valid_end_date
      FROM
        valid_dates
    ) valid_date_ranges_with_null
    WHERE
      valid_end_date IS NOT NULL
  )
SELECT
    vdr.employee_id,e.employee_name,e.employee_address,wp.building_id,vdr.valid_start_date,vdr.valid_end_date
FROM
  employee e
INNER JOIN
  valid_date_ranges vdr
ON
  e.employee_id = vdr.employee_id
LEFT OUTER JOIN -- there may be employees without workplace
  workplace wp
ON
  wp.employee_id = vdr.employee_id
WHERE
      e.valid_start_date < vdr.valid_end_date
  AND e.valid_end_date > vdr.valid_start_date
  AND wp.valid_start_date < vdr.valid_end_date
  AND wp.valid_end_date > vdr.valid_start_date
;

我似乎产生了正确的结果,但是我希望查询运行得更快(当前这是我的管道的瓶颈)。另外,我必须做同样的事情,最多将5个表连接在一起,每个表最多30亿行,所以我真的希望有一种方法可以优化此查询。你能帮我吗?谢谢!

我使用的是Hive 2.1.0,而且还没有非等额联接。

pk110987 回答:在Hive中加入SCD Type 2表的性能

暂时没有好的解决方案,如果你有好的解决方案,请发邮件至:iooj@foxmail.com
本文链接:https://www.f2er.com/3120036.html

大家都在问