我正在处理一个看起来像这样的数据框-
val df = Seq(
(0.0 ),(0.0 ),(0.317),(-0.78),(-0.37),(0.0 )
).toDF("importance")
我现在有更多代码来将labels
和features
列作为数组,如下所示-
val labels = Array(0,1,2)
import org.apache.spark.sql.functions.typedLit
val df1 = df.withColumn("labels",typedLit(labels))
val featureNames = Array("a","b","c","d")
val df2 = df1.withColumn("features",typedLit(featureNames))
scala> df2.show(false)
+----------+---------+------------+
|importance|labels |features |
+----------+---------+------------+
|0.0 |[0,2]|[a,b,c,d]|
|0.0 |[0,d]|
|0.317 |[0,d]|
|-0.78 |[0,d]|
|-0.37 |[0,d]|
+----------+---------+------------+
现在,使用此数据框,我想将重要性列的每个值与labels
和features
数组的每个元素对齐。所以输出应该看起来像这样-
label feature name importance
0 a 0
0 b 0
0 c 0
0 d 0.3176
1 a 0
1 b 0
1 c -0.78
1 d -0.37
2 a 0
2 b 0
2 c 0
2 d 0
因此,第一条记录具有label=0
和feature=a
,并且具有importance = 0
。