我正在尝试使用Pipeline
和imblearn
中的GridSearchCV
类来获取最佳参数,以对不平衡数据集进行分类。根据{{3}}中提到的答案,我想省略对验证集的重采样,而仅对训练集进行重采样,imblearn
的{{1}}似乎正在这样做。但是,在实施接受的解决方案时出现错误。请让我知道我在做什么错。下面是我的实现:
Pipeline
参数:
def imb_pipeline(clf,X,y,params):
model = Pipeline([
('sampling',SMOTE()),('classification',clf)
])
score={'AUC':'roc_auc','RECALL':'recall','PRECISION':'precision','F1':'f1'}
gcv = GridSearchCV(estimator=model,param_grid=params,cv=5,scoring=score,n_jobs=12,refit='F1',return_train_score=True)
gcv.fit(X,y)
return gcv
for param,classifier in zip(params,classifiers):
print("Working on {}...".format(classifier[0]))
clf = imb_pipeline(classifier[1],X_scaled,param)
print("Best parameter for {} is {}".format(classifier[0],clf.best_params_))
print("Best `F1` for {} is {}".format(classifier[0],clf.best_score_))
print('-'*50)
print('\n')
分类器:
[{'penalty': ('l1','l2'),'C': (0.01,0.1,1.0,10)},{'n_neighbors': (10,15,25)},{'n_estimators': (80,100,150,200),'min_samples_split': (5,7,10,20)}]
错误:
[('Logistic Regression',LogisticRegression(C=1.0,class_weight=None,dual=False,fit_intercept=True,intercept_scaling=1,l1_ratio=None,max_iter=100,multi_class='warn',n_jobs=None,penalty='l2',random_state=None,solver='warn',tol=0.0001,verbose=0,warm_start=False)),('KNearestNeighbors',KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',metric_params=None,n_neighbors=5,p=2,weights='uniform')),('Gradient Boosting Classifier',GradientBoostingClassifier(criterion='friedman_mse',init=None,learning_rate=0.1,loss='deviance',max_depth=3,max_features=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,min_samples_leaf=1,min_samples_split=2,min_weight_fraction_leaf=0.0,n_estimators=100,n_iter_no_change=None,presort='auto',subsample=1.0,validation_fraction=0.1,warm_start=False))]