使用DataFrame按组累计的总和-Pyspark

我的代码:

df=temp_df.groupBy('date','id').count()
windowval = (Window.partitionBy('date','id').orderBy('date','id').rangeBetween(Window.unboundedPreceding,0))
final_df = df.withColumn('cum_sum',F.sum('count').over(windowval)).orderBy('date','id').show()

请更正我的代码,我相信使用Window(rangeBetween)时会出错。

谢谢

DF:
+-------------------+------------------+-----+
|               date|                id|count|
+-------------------+------------------+-----+
|2007-11-04 00:00:00|                 5|    4|
|2007-11-05 00:00:00|                 5|    7|
|2007-11-06 00:00:00|                 5|    3|
|2007-11-06 00:00:00|                 8|    3|
|2007-11-07 00:00:00|                 5|    7|
|2007-11-08 00:00:00|                 5|    2|
|2007-11-08 00:00:00|                 8|    4|
+-------------------+------------------+-----+

Expected output:

+-------------------+------------------+-----+-------+
|               date|                id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00|                 5|    4|      4|
|2007-11-05 00:00:00|                 5|    7|     11|
|2007-11-06 00:00:00|                 5|    3|     14|
|2007-11-06 00:00:00|                 8|    3|      3|
|2007-11-07 00:00:00|                 5|    7|     21|
|2007-11-08 00:00:00|                 5|    2|     23|
|2007-11-08 00:00:00|                 8|    4|      7|
+-------------------+------------------+-----+-------+

My Output:

+-------------------+------------------+-----+-------+
|               date|                id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00|                 5|    4|      4|
|2007-11-05 00:00:00|                 5|    7|      7|
|2007-11-06 00:00:00|                 5|    3|      3|
|2007-11-06 00:00:00|                 8|    3|      3|
|2007-11-07 00:00:00|                 5|    7|      7|
|2007-11-08 00:00:00|                 5|    2|      2|
|2007-11-08 00:00:00|                 8|    4|      4|
+-------------------+------------------+-----+-------+


SinoDinosaur 回答:使用DataFrame按组累计的总和-Pyspark

只需将当前代码更改为:

df = temp_df.groupBy('date','id').count()

windowval = Window.partitionBy('id').orderBy('date').rangeBetween(Window.unboundedPreceding,0)

final_df = df.withColumn('cum_sum',F.sum('count').over(windowval)).orderBy('date','id').show()

按ID和日期划分时,每个(iddate)组合都是唯一的。您需要按idorderBy date

进行分区
本文链接:https://www.f2er.com/3104721.html

大家都在问