我正在使用pyspark.ml对来自JupyterLab笔记本中AWS EMR上s3存储桶中.json数据的json数据训练机器学习模型。存储桶不是我的,但我认为访问工作正常,因为数据预处理,功能工程等工作正常。但是当我调用cv.fit(training_data)
函数时,训练过程一直运行到它几乎完成为止(由状态栏指示),但随后抛出错误:
Exception in thread cell_monitor-64:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/threading.py",line 917,in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.7/threading.py",line 865,in run
self._target(*self._args,**self._kwargs)
File "/opt/conda/lib/python3.7/site-packages/awseditorssparkmonitoringwidget-1.0-py3.7.egg/awseditorssparkmonitoringwidget/cellmonitor.py",line 178,in cell_monitor
job_binned_stages[job_id][stage_id] = all_stages[stage_id]
KeyError: 6571
我尚未找到有关此错误的任何信息。到底是怎么回事?
这是我的管道:
train,test = clean_df.randomSplit([0.8,0.2],seed=42)
va1 = VectorAssembler(inputCols="vars",outputCol="vars")
scaler = StandardScaler(inputCol="to_scale",outputCol="scaled_features")
va2 = VectorAssembler(inputCols=["more_vars","scaled_features"],outputCol="features")
gbt = GBTClassifier()
pipeline = Pipeline(stages=[va1,scaler,va2,gbt])
paramGrid = ParamGridBuilder()\
.addGrid(gbt.maxDepth,[2,5])\
.addGrid(gbt.maxIter,[10,100])\
.build()
crossval = CrossValidator(estimator=pipeline,estimatorParamMaps=paramGrid,evaluator=MulticlassClassificationEvaluator(metricName='f1'),numFolds=3)
cvModel = crossval.fit(train)
第二,我有预感,我可能会在Python 3.8中解决;我可以在EMR上安装Python 3.8吗?