使用s3存储桶中的数据在AWS EMR上使用pyspark.ml训练模型时发生KeyError

我正在使用pyspark.ml对来自JupyterLab笔记本中AWS EMR上s3存储桶中.json数据的json数据训练机器学习模型。存储桶不是我的,但我认为访问工作正常,因为数据预处理,功能工程等工作正常。但是当我调用cv.fit(training_data)函数时,训练过程一直运行到它几乎完成为止(由状态栏指示),但随后抛出错误:

Exception in thread cell_monitor-64:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py",line 917,in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py",line 865,in run
    self._target(*self._args,**self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py",line 178,in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6571

我尚未找到有关此错误的任何信息。到底是怎么回事?

这是我的管道:

train,test = clean_df.randomSplit([0.8,0.2],seed=42)

va1 = VectorAssembler(inputCols="vars",outputCol="vars")

scaler = StandardScaler(inputCol="to_scale",outputCol="scaled_features")

va2 = VectorAssembler(inputCols=["more_vars","scaled_features"],outputCol="features")

gbt = GBTClassifier()   

pipeline = Pipeline(stages=[va1,scaler,va2,gbt])

paramGrid = ParamGridBuilder()\
    .addGrid(gbt.maxDepth,[2,5])\
    .addGrid(gbt.maxIter,[10,100])\
    .build() 

crossval = CrossValidator(estimator=pipeline,estimatorParamMaps=paramGrid,evaluator=MulticlassClassificationEvaluator(metricName='f1'),numFolds=3)

cvModel = crossval.fit(train)

第二,我有预感,我可能会在Python 3.8中解决;我可以在EMR上安装Python 3.8吗?

zhangyaoning 回答:使用s3存储桶中的数据在AWS EMR上使用pyspark.ml训练模型时发生KeyError

我们遇到了同样的问题。我们正在使用Hyperopt,我们只是添加了try,只是为了避免此问题。该错误不断出现,但仍在运行。 该错误似乎会影响显示EMR笔记本上Spark作业内部进度的条形,但可以完成管道。

# Defining the hyperopt objetive
def objetive(params):
    try:
        # Pipeline here with Vector Assembler and GBT
        return {'loss': -metrics_val.areaUnderPR,"status": STATUS_OK,"output_dict": output_dict}
    except Exception as e:
        print("## Exception",e)
        return {'loss': 0,"status": STATUS_FAIL,"except": e,"output_dict": {"params": params}}

我们得到的异常如下:

Exception in thread cell_monitor-18:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/threading.py",line 926,in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.7/threading.py",line 870,in run
    self._target(*self._args,**self._kwargs)
  File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py",line 178,in cell_monitor
    job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 2395

但是我们得到了所有hyperopt的输出:

100%|##########| 5/5 [34:36<00:00,415.32s/trial,best loss: -0.3907675279893325]
本文链接:https://www.f2er.com/3082841.html

大家都在问