如何在窗口范围内取决于列值的窗口内部计算平均值?

我有以下数据:

columns = ['aircraft_id','liftoff','timestamp','value']

l =[
(   '0003177d',1550000476500,1550000467000,-80.15625),(   '0003177d',1550000467500,1550000468000,1550000468500,1550000469000,-79.8046875),1550000469500,1550000470000,1550000470500,1550000471000,1550000471500,1550000472000,1550000472500,-80.5078125),1550000473000,-80.859375),1550000473500,1550000474000,1550000474500,1550000475000,1550000475500,1550000476000,-80.5078125)]

df=spark.createDataFrame(l,columns)

df.show()
+-----------+-------------+-------------+-----------+
|aircraft_id|      liftoff|    timestamp|      value|
+-----------+-------------+-------------+-----------+
|   0003177d|1550000476500|1550000467000|  -80.15625|
|   0003177d|1550000476500|1550000467500|  -80.15625|
|   0003177d|1550000476500|1550000468000|  -80.15625|
|   0003177d|1550000476500|1550000468500|  -80.15625|
|   0003177d|1550000476500|1550000469000|-79.8046875|
|   0003177d|1550000476500|1550000469500|-79.8046875|
|   0003177d|1550000476500|1550000470000|-79.8046875|
|   0003177d|1550000476500|1550000470500|-79.8046875|
|   0003177d|1550000476500|1550000471000|-79.8046875|
|   0003177d|1550000476500|1550000471500|-79.8046875|
|   0003177d|1550000476500|1550000472000|  -80.15625|
|   0003177d|1550000476500|1550000472500|-80.5078125|
|   0003177d|1550000476500|1550000473000| -80.859375|
|   0003177d|1550000476500|1550000473500| -80.859375|
|   0003177d|1550000476500|1550000474000| -80.859375|
|   0003177d|1550000476500|1550000474500| -80.859375|
|   0003177d|1550000476500|1550000475000| -80.859375|
|   0003177d|1550000476500|1550000475500| -80.859375|
|   0003177d|1550000476500|1550000476000| -80.859375|
|   0003177d|1550000476500|1550000476500|-80.5078125|
+-----------+-------------+-------------+-----------+

我想计算一个窗口内部值的平均值,其中窗口之间的范围取决于时间戳的当前值到“提货”的时间戳。每架飞机的升空值都不同。

我尝试:

from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('val',F.mean('value').over(Window.partitionBy('aircraft_id','ini_TO','liftoff').orderBy('timestamp').rangeBetween(df['timestamp'],df['liftoff']))

但是它不起作用,有解决方案吗?

预期结果:

  • 对于第一行,窗口的范围是1550000467000至1550000476500,因此平均值是20个值的总和并除以20(-80,33203)。
  • 对于第二行,窗口的范围是1550000467500至1550000476500,因此平均值是19个值的总和并除以19(-80,34128 |)。
  • 等等...
+-----------+-------------+-------------+---------+---------+
|aircraft_id|      liftoff|    timestamp|    value|      val|
+-----------+-------------+-------------+---------+---------+
|   0003177d|1550000476500|1550000467000|-80,15625|-80,33203|
|   0003177d|1550000476500|1550000467500|-80,34128|
|   0003177d|1550000476500|1550000468000|-80,35156|
|   0003177d|1550000476500|1550000468500|-80,36305|
|   0003177d|1550000476500|1550000469000|-79,80469|-80,37598|
|   0003177d|1550000476500|1550000469500|-79,41406|
|   0003177d|1550000476500|1550000470000|-79,45759|
|   0003177d|1550000476500|1550000470500|-79,50781|
|   0003177d|1550000476500|1550000471000|-79,56641|
|   0003177d|1550000476500|1550000471500|-79,63565|
|   0003177d|1550000476500|1550000472000|-80,71875|
|   0003177d|1550000476500|1550000472500|-80,50781|-80,78125|
|   0003177d|1550000476500|1550000473000|-80,85938|-80,81543|
|   0003177d|1550000476500|1550000473500|-80,80915|
|   0003177d|1550000476500|1550000474000|-80,80078|
|   0003177d|1550000476500|1550000474500|-80,78906|
|   0003177d|1550000476500|1550000475000|-80,77148|
|   0003177d|1550000476500|1550000475500|-80,74219|
|   0003177d|1550000476500|1550000476000|-80,68359|
|   0003177d|1550000476500|1550000476500|-80,50781|
+-----------+-------------+-------------+---------+---------+
zmg880831 回答:如何在窗口范围内取决于列值的窗口内部计算平均值?

我想您已经快到了,您只需要在窗口规范中设置rangeBetween即可从当前行Window.currentRow开始,直到窗口范围Window.unboundedFollowing的结尾,如下所示:

注意:示例数据集中未提供ini_TO,因此已从partitionBy中删除进行测试。

wind_spec = Window.partitionBy('aircraft_id','Liftoff').orderBy('timestamp').rangeBetween(Window.currentRow,Window.unboundedFollowing)

以上窗口将提供所需的输出:

df.withColumn('val',F.mean('value').over(wind_spec)).show()
+-----------+-------------+-------------+-----------+------------------+
|aircraft_id|      Liftoff|    timestamp|      value|               val|
+-----------+-------------+-------------+-----------+------------------+
|   0003177d|1550000476500|1550000467000|  -80.15625|      -80.33203125|
|   0003177d|1550000476500|1550000467500|  -80.15625|-80.34128289473684|
|   0003177d|1550000476500|1550000468000|  -80.15625|       -80.3515625|
|   0003177d|1550000476500|1550000468500|  -80.15625|-80.36305147058823|
|   0003177d|1550000476500|1550000469000|-79.8046875|    -80.3759765625|
|   0003177d|1550000476500|1550000469500|-79.8046875|       -80.4140625|
|   0003177d|1550000476500|1550000470000|-79.8046875|-80.45758928571429|
|   0003177d|1550000476500|1550000470500|-79.8046875|       -80.5078125|
|   0003177d|1550000476500|1550000471000|-79.8046875|      -80.56640625|
|   0003177d|1550000476500|1550000471500|-79.8046875| -80.6356534090909|
|   0003177d|1550000476500|1550000472000|  -80.15625|         -80.71875|
|   0003177d|1550000476500|1550000472500|-80.5078125|         -80.78125|
|   0003177d|1550000476500|1550000473000| -80.859375|    -80.8154296875|
|   0003177d|1550000476500|1550000473500| -80.859375|-80.80915178571429|
|   0003177d|1550000476500|1550000474000| -80.859375|      -80.80078125|
|   0003177d|1550000476500|1550000474500| -80.859375|       -80.7890625|
|   0003177d|1550000476500|1550000475000| -80.859375|     -80.771484375|
|   0003177d|1550000476500|1550000475500| -80.859375|       -80.7421875|
|   0003177d|1550000476500|1550000476000| -80.859375|      -80.68359375|
|   0003177d|1550000476500|1550000476500|-80.5078125|       -80.5078125|
+-----------+-------------+-------------+-----------+------------------+
本文链接:https://www.f2er.com/3124125.html

大家都在问