如何将Spark Dense Matrix转换为Spark Dataframe

我正在尝试在Scala Spark中实现一些代码,其中我有一个多类Logistic回归模型,并且该模型生成系数矩阵。

这是代码-

val training = spark.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")


training.show(false)
+-----+-----------------------------------------------------------+
|label|features                                                   |
+-----+-----------------------------------------------------------+
|1.0  |(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])          |
|1.0  |(4,[-0.555556,0.25,-0.864407,-0.916667])         |
|1.0  |(4,[-0.722222,-0.166667,-0.833333])    |
|1.0  |(4,0.166667,-0.694915,-0.916667])     |
|0.0  |(4,[0.166667,-0.416667,0.457627,0.5])            |
|1.0  |(4,[-0.833333,-0.916667])                |
|2.0  |(4,[-1.32455E-7,0.220339,0.0833333])   |
|2.0  |(4,-0.333333,0.0169491,-4.03573E-8])|
|1.0  |(4,[-0.5,0.75,-0.830508,-1.0])                   |
|0.0  |(4,[0.611111,0.694915,0.416667])                   |
|0.0  |(4,[0.222222,0.423729,0.583333])       |
|1.0  |(4,-1.0])         |
|1.0  |(4,-0.916667])          |
|2.0  |(4,0.0508474,-4.03573E-8])  |
|2.0  |(4,[-0.0555556,-0.833333,-0.25])       |
|2.0  |(4,[-0.166667,-0.0169491,-0.0833333])  |
|1.0  |(4,[-0.944444,-0.898305,[-0.277778,-0.583333,-0.166667])   |
|0.0  |(4,[0.111111,0.38983,0.166667])        |
|2.0  |(4,0.0847457,-0.0833333])   |
+-----+-----------------------------------------------------------+

我要为其拟合模型的3个标签。

scala> training.select("label").distinct.show
+-----+
|label|
+-----+
|  0.0|
|  1.0|
|  2.0|
+-----+

拟合Logistic回归模型

import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
​
// Fit the model
val lrModel = lr.fit(training)
​

现在,当我尝试查看系数矩阵时,它给了我一个具有3行(3个标签)和4列(4个输入功能)的矩阵

scala> lrModel.coefficientMatrix.toDense
res13: org.apache.spark.ml.linalg.DenseMatrix =
0.0  0.0  0.0                  0.3176483191238039
0.0  0.0  -0.7803943459681859  -0.3769611423403096
0.0  0.0  0.0                  0.0

这是每个标签的截距-

scala> lrModel.interceptVector
res15: org.apache.spark.ml.linalg.Vector = [0.05165231659832854,-0.12391224990853622,0.07225993331020768]

我想使用系数矩阵和截距矢量创建特征重要性火花数据帧,以得到最终的最终数据帧,如下所示-

label feature name  coefficient intercept
0         0             0         0.051
0         1             0         0.051
0         2             0         0.051
0         3             0.3176    0.051
1         0             0         -0.123
1         1             0         -0.123
1         2             -0.78     -0.123
1         3             -0.37     -0.123
2         0             0         0.072
2         1             0         0.072
2         2             0         0.072
2         3             0         0.072

每个要素的每个标签都有一个系数,因此输出中的总记录为labels * features,即3 * 4 = 12

我希望此过程是动态的,可以将其包装在一个函数中,以便可以将其重新用于任何数量的功能和标签。

我正在从这里读取数据-https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

ttc123 回答:如何将Spark Dense Matrix转换为Spark Dataframe

我在这里假设lr是您的pyspark的逻辑回归作为示例。在下面的代码中,我尝试进行多项逻辑回归

+
-

上面的代码对我有用,不用担心序列分配您训练的顺序

本文链接:https://www.f2er.com/3142869.html

大家都在问