我的代码:
df=temp_df.groupBy('date','id').count()
windowval = (Window.partitionBy('date','id').orderBy('date','id').rangeBetween(Window.unboundedPreceding,0))
final_df = df.withColumn('cum_sum',F.sum('count').over(windowval)).orderBy('date','id').show()
请更正我的代码,我相信使用Window(rangeBetween)时会出错。
谢谢
DF:
+-------------------+------------------+-----+
| date| id|count|
+-------------------+------------------+-----+
|2007-11-04 00:00:00| 5| 4|
|2007-11-05 00:00:00| 5| 7|
|2007-11-06 00:00:00| 5| 3|
|2007-11-06 00:00:00| 8| 3|
|2007-11-07 00:00:00| 5| 7|
|2007-11-08 00:00:00| 5| 2|
|2007-11-08 00:00:00| 8| 4|
+-------------------+------------------+-----+
Expected output:
+-------------------+------------------+-----+-------+
| date| id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00| 5| 4| 4|
|2007-11-05 00:00:00| 5| 7| 11|
|2007-11-06 00:00:00| 5| 3| 14|
|2007-11-06 00:00:00| 8| 3| 3|
|2007-11-07 00:00:00| 5| 7| 21|
|2007-11-08 00:00:00| 5| 2| 23|
|2007-11-08 00:00:00| 8| 4| 7|
+-------------------+------------------+-----+-------+
My Output:
+-------------------+------------------+-----+-------+
| date| id|count|cum_sum|
+-------------------+------------------+-----+-------+
|2007-11-04 00:00:00| 5| 4| 4|
|2007-11-05 00:00:00| 5| 7| 7|
|2007-11-06 00:00:00| 5| 3| 3|
|2007-11-06 00:00:00| 8| 3| 3|
|2007-11-07 00:00:00| 5| 7| 7|
|2007-11-08 00:00:00| 5| 2| 2|
|2007-11-08 00:00:00| 8| 4| 4|
+-------------------+------------------+-----+-------+