应用逻辑后,正则表达式模式无法在pyspark中运行

我的数据如下:

>>> df1.show()
+-----------------+--------------------+
|     corruptNames|       standardNames|
+-----------------+--------------------+
|Sid is (Good boy)|     Sid is Good Boy|
|    New York Life| New York Life In...|
+-----------------+--------------------+

因此,根据上述数据,我需要应用正则表达式,创建一个新列,然后像第二列一样获取数据,即standardNames。我尝试了以下代码:

spark.sql("select *,case when corruptNames rlike '[^a-zA-Z ()]+(?![^(]*))' or corruptNames rlike 'standardNames' then standardNames else 0 end as standard from temp1").show()  

它抛出以下错误:

pyspark.sql.utils.AnalysisException: "cannot resolve '`standardNames`' given input columns: [temp1.corruptNames,temp1. standardNames];
fajkasjifise 回答:应用逻辑后,正则表达式模式无法在pyspark中运行

尝试不带select sql的示例。我假设如果正则表达式模式为true,则基于 corruptNames 创建一个名为 standardNames 的新列,否则“执行其他操作……”。

注意:您的模式无法编译,因为您需要使用\转义倒数第二个)。

pattern = '[^a-zA-Z ()]+(?![^(]*))' #this won't compile
pattern = r'[^a-zA-Z ()]+(?![^(]*\))' #this will

代码

import pyspark.sql.functions as F

df_text = spark.createDataFrame([('Sid is (Good boy)',),('New York Life',)],('corruptNames',))

pattern = r'[^a-zA-Z ()]+(?![^(]*\))'

df = (df_text.withColumn('standardNames',F.when(F.col('corruptNames').rlike(pattern),F.col('corruptNames'))
             .otherwise('Do something else'))
             .show()
     )

df.show()

#+-----------------+---------------------+
#|     corruptNames|        standardNames|
#+-----------------+---------------------+
#|Sid is (Good boy)|    Do something else|
#|    New York Life|    Do something else|
#+-----------------+---------------------+

本文链接:https://www.f2er.com/3166382.html

大家都在问