我有一个pyspark Dataframe
,其中包含许多列,其中列为Array类型和String列:
numbers <Array> | name<String>
------------------------------|----------------
["160001","160021"] | A
------------------------------|----------------
["160001","1600","42345"] | B
------------------------------|----------------
["160001","9867","42345"] | C
------------------------------|----------------
["160001","8650","2345"] | A
------------------------------|----------------
["2456","78568","42345"] | B
-----------------------------------------------
我想从数字列if the name column is not "B".
And keep it if the name column is "B".
中跳过包含4位数字的数字
例如:
In the lines 2 and 5,I have "1600" and "2456" contains 4 digits
并且名称列为“ B”,我应该将它们与列值分开:
------------------------------|----------------
["160001","42345"] | B
------------------------------|----------------
["2456","42345"] | B
-----------------------------------------------
在第3行和第4行中,我有一个numbers列,其中包含一个4位数的数字,但是该列的名称不同于“ B” ==>所以我应该跳过它们。
示例:
------------------------------|----------------
["160001","2345"] | A
------------------------------|----------------
预期结果:
numbers <Array> | name<String>
------------------------------|----------------
["160001","42345"] | C
------------------------------|----------------
["160001"] | A
------------------------------|----------------
["2456","42345"] | B
-----------------------------------------------
我该怎么办? 谢谢