如何在Pyspark中过滤数组列值

我有一个pyspark Dataframe,其中包含许多列,其中列为Array类型和String列:

numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","160021"]           |     A
------------------------------|----------------
["160001","1600","42345"]    |     B
------------------------------|----------------
["160001","9867","42345"]    |     C
------------------------------|----------------
["160001","8650","2345"]     |     A
------------------------------|----------------
["2456","78568","42345"]     |     B
-----------------------------------------------

我想从数字列if the name column is not "B". And keep it if the name column is "B".中跳过包含4位数字的数字 例如:

In the lines 2 and 5,I have "1600" and "2456" contains 4 digits并且名称列为“ B”,我应该将它们与列值分开:

------------------------------|----------------
["160001","42345"]    |     B
------------------------------|----------------
["2456","42345"]     |     B
-----------------------------------------------

在第3行和第4行中,我有一个numbers列,其中包含一个4位数的数字,但是该列的名称不同于“ B” ==>所以我应该跳过它们。

示例:

------------------------------|----------------
["160001","2345"]     |     A
------------------------------|----------------

预期结果:

    numbers  <Array>              |    name<String>
------------------------------|----------------
["160001","42345"]           |     C
------------------------------|----------------
["160001"]                    |     A
------------------------------|----------------
["2456","42345"]     |     B
-----------------------------------------------

我该怎么办? 谢谢

i850528 回答:如何在Pyspark中过滤数组列值

自Spark 2.4起,您可以使用高阶函数FILTER来过滤数组。将其与if表达式结合使用可以解决问题:

df.selectExpr("if(name != \'B',FILTER(numbers,x -> length(x) != 4),numbers) AS numbers","name")
,

您需要编写udf进行数组过滤,并将其与app.get('/latest-attendance',async (req,res) => { // This logs only when debugger mode is turned on console.log('/attendance/latest-attendance was hit'); } 子句一起使用,以在诸如when这样的特定条件下应用udf:

where name == B
本文链接:https://www.f2er.com/3139504.html

大家都在问