我有以下数据框df4
|Itemno |fits_assembly_id |fits_assembly_name |assembly_name
|0450056 |13039 135502 141114 4147 138865 2021 9164 |OIL PUMP ASSEMBLY A01EA09CA 4999202399920239A06 A02EA09CA A02EA09CB A02EA09CC |OIL PUMP ASSEMBLY 999202399920239A06
并且我正在使用以下代码来处理/清理上述数据框
from pyspark.ml.feature import StopWordsRemover,RegexTokenizer
from pyspark.sql.functions import expr
# Task-1: Regex Tokenizer
tk = RegexTokenizer(pattern=r'(?:\p{Punct}|\s)+',inputCol='fits_assembly_name',outputCol='temp1')
df5 = tk.transform(df4)
#Task-2: StopWordsRemover
sw = StopWordsRemover(inputCol='temp1',outputCol='temp2')
df6 = sw.transform(df5)
# #Task-3: Remove duplicates
df7 = df6.withColumn('fits_assembly_name',expr('concat_ws(" ",array_distinct(temp2))')) \
.drop('temp1','temp2')
我想一次性处理fits_assembly_name
中的列assembly_name
和RegexTokenizer & StopWordsRemover
。您能否分享如何实现?