在Spark DataFrame过滤器功能中，无论我是否使用udf，为什么在耗时上有如此大的差异？

2024-05-05 • 问答

在我的测试代码中，我想知道过滤后的数据帧的计数。所以我列举了两种方法，但是在时间上有很大的不同。我想知道有关udf的机制是否复杂，当需要使用复杂的过滤器逻辑时该怎么办？

# total data count
# the dataType of "time" column is "timestamp" type.
logs_df.count()

# method 1  
import pyspark.sql.functions as f

# consume 2.95s
%time logs_df.filter(f.second(logs_df.time) == 59).count()

# method 2
def sd_filter(x):
    return x.second == 59
u_filter = udf(sd_filter,returnType=BooleanType())

# consume 50.5s
%time logs_df.filter(u_filter(logs_df.time)).count()

the code and time comsuming

非常感谢

superursine 回答：在Spark DataFrame过滤器功能中，无论我是否使用udf，为什么在耗时上有如此大的差异？

暂时没有好的解决方案，如果你有好的解决方案，请发邮件至：iooj@foxmail.com

pyspark user-defined-functions

本文链接：https://www.f2er.com/3139763.html

在Spark DataFrame过滤器功能中，无论我是否使用udf，为什么在耗时上有如此大的差异？

superursine 回答：在Spark DataFrame过滤器功能中，无论我是否使用udf，为什么在耗时上有如此大的差异？

大家都在问