PysparkpipelineModel.transform错误'字段“ cut_catVec”不存在。\ n可用字段

2024-05-02 • 问答

我正在尝试在Pyspark中运行MLlip以预测价格，并且我正在使用具有以下架构的数据框：

[('cut','string'),('color',('clarity',('carat','double'),('table','int'),('x',('y',('z',('price','int')]

因此在确定了分类列和数字列之后：

`categ_col= ['cut','color','clarity']
num_col= ['carat','table','x','y','z']``

在下面的脚本I中：

首先使用StringIndexer将字符串/文本值转换为数值，然后再使用OneHotEncoderEstimator
通过Spark MLLib将每个Stringindexed或转换后的值转换为One Hot Encoded值。
VectorAssembler用于将所有特征从多个包含double类型的列中组合成一个向量

还将过程的每个步骤附加在一个stages数组中

`from pyspark.ml.feature import StringIndexer,OneHotEncoderEstimator,VectorAssembler
stages = []
for catcol in categ_col:
   stringIndexer = StringIndexer(inputCol = catcol,outputCol =                         catcol + 'Index')
 OHencoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()],outputCols=[catcol + "_catVec"])
stages += [stringIndexer,OHencoder]
assemblerInputs = [c + "_catVec" for c in categ_col] + num_col
Vectassembler = VectorAssembler(inputCols=assemblerInputs,outputCol="features")
stages += [Vectassembler]`

当我进入下一步时：

    `from pyspark.ml import Pipeline
    cols = mllipdf.columns
    pipeline = Pipeline(stages = stages)
    pipelineModel = pipeline.fit(mllipdf)
    mllipdf = pipelineModel.transform(mllipdf)
    selectedCols = ['features']+cols
    mllipdf = mllipdf.select(selectedCols)
    pd.DataFrame(mllipdf.take(5),columns=mllipdf.columns)`

我遇到了一个错误鳕鱼 mllipdf = pipelineModel.transform(mllipdf)" line saying "IllegalArgumentException: 'Field "cut_catVec" does not exist.\nAvailable fields: cut,color,clarity,carat,table,x,y,z,price,clarityIndex,clarity_catVec'

不确定这里会发生什么

PysparkpipelineModel.transform错误'字段“ cut_catVec”不存在。\ n可用字段

qq751046302 回答：PysparkpipelineModel.transform错误'字段“ cut_catVec”不存在。\ n可用字段

大家都在问