CATBoost和GridSearch

model.fit(train_data,y=label_data,eval_set=eval_dataset)
eval_dataset = Pool(val_data,val_labels)
model = CatBoostClassifier(depth=8 or 10,iterations=10,task_type="GPU",devices='0-2',eval_metric='accuracy',boosting_type="Ordered",bagging_temperature=0,use_best_model=True)

当我运行上面的代码时(以2次单独运行/深度设置为8或10),我得到以下结果:

深度10:0.6864865 深度8:0.6756757

我想以某种方式设置和运行GridSearch-使其运行完全相同的组合并产生完全相同的结果-与我手动运行代码时一样。

GridSearch代码:

model = CatBoostClassifier(iterations=10,depth=10,use_best_model=True)

grid = {'depth': [8,10]}
grid_search_result = GridSearchCV(model,grid,cv=2)
results = grid_search_result.fit(train_data,eval_set=eval_dataset) 

问题:

  1. 我希望GridSearch使用我的“ eval_set”来比较/验证所有不同的运行(例如手动运行时)-但它使用了其他我不理解是什么但它不知道的东西似乎完全看“ eval_set”?

  2. 它不仅产生2个结果-而且取决于“ cv”(交叉验证拆分策略)。它运行3、5、7、9或11运行?我不要。

  3. 我试图通过调试器遍历整个“结果”对象-但我根本找不到最佳或所有其他运行的验证“准确性”分数。我可以找到很多其他值-但它们都不符合我的期望。这些数字与“ eval_set”数据集生成的数字不匹配吗?

我通过实现自己的简单GridSearch解决了我的问题(如果它可以帮助/激发其他人:-)):如果您对代码有任何评论,请告诉我:-)

import pandas as pd
from catboost import CatBoostClassifier,Pool
from sklearn.model_selection import GridSearchCV
import csv
from datetime import datetime

# Initialize data

train_data = pd.read_csv('./train_x.csv')
label_data = pd.read_csv('./labels_train_x.csv')
val_data = pd.read_csv('./val_x.csv')
val_labels = pd.read_csv('./labels_val_x.csv')

eval_dataset = Pool(val_data,val_labels)

ite = [1000,2000]
depth = [6,7,8,9,10]
max_bin = [None,32,46,100,254]
l2_leaf_reg = [None,2,10,20,30]
bagging_temperature = [None,0.5,1]
random_strength = [None,1,5,10]
total_runs = len(ite) * len(depth) * len(max_bin) * len(l2_leaf_reg) * len(bagging_temperature) * len(random_strength)

print('Total runs: ' + str(total_runs))

counter = 0

file_name = './Results/Catboost_' + str(datetime.now().strftime("%d_%m_%Y_%H_%M_%S")) + '.csv'

row = ['Validation accuray','Logloss','Iterations','Depth','Max_bin','L2_leaf_reg','Bagging_temperature','Random_strength']
with open(file_name,'a') as csvFile:
    writer = csv.writer(csvFile)
    writer.writerow(row)
csvFile.close()

for a in ite:
    for b in depth:
        for c in max_bin:
            for d in l2_leaf_reg:
                for e in bagging_temperature:
                    for f in random_strength:
                        model = CatBoostClassifier(task_type="GPU",use_best_model=True,iterations=a,depth=b,max_bin=c,l2_leaf_reg=d,bagging_temperature=e,random_strength=f)
                        counter += 1
                        print('Run # ' + str(counter) + '/' + str(total_runs))
                        result = model.fit(train_data,eval_set=eval_dataset,verbose=1)

                        accuracy = float(result.best_score_['validation']['accuracy'])
                        logLoss = result.best_score_['validation']['Logloss']

                        row = [ accuracy,logLoss,('Auto' if a == None else a),('Auto' if b == None else b),('Auto' if c == None else c),('Auto' if d == None else d),('Auto' if e == None else e),('Auto' if f == None else f)]

                        with open(file_name,'a') as csvFile:
                            writer = csv.writer(csvFile)
                            writer.writerow(row)
                        csvFile.close()
aqmes 回答:CATBoost和GridSearch

Catboost中的评估集充当保留集。

在GridSearchCV中,对您的train_data执行简历。

一种解决方案是将您的train_data和eval_dataset合并,并在GridSearchCV中传递train和eval的索引。尝试在 cv 参数中产生两组索引。然后,您将只有一个分割数和准确度数,它们将为您提供相同的结果。

本文链接:https://www.f2er.com/3067142.html

大家都在问