在Sci-Kit Learn中为K折交叉验证拆分数据集

2024-05-03 • 问答

我被分配了一项任务，该任务需要创建决策树分类器并使用训练集和10倍交叉验证来确定准确率。我浏览了cross_val_predict的文档，因为我认为这是我需要的模块。

我遇到的麻烦是数据集的拆分。据我所知，在通常情况下，train_test_split()方法用于将数据集分为2个 train 和 test 。据我了解，对于K折验证，您需要进一步将训练集拆分为K个零件。

我的问题是：是否需要将开始时的数据集分为 train 和 test ？

这取决于。我个人的看法是，您必须将数据集分为训练和测试集，然后可以使用K折对训练集进行交叉验证。为什么呢因为在训练后进行测试并根据看不见的示例对模型进行微调很有趣。

但是有些人只是做交叉测试。这是我经常使用的工作流程：

# Data Partition
X_train,X_valid,Y_train,Y_valid = model_selection.train_test_split(X,Y,test_size=0.2,random_state=21)

# Cross validation on multiple model to see which models gives the best results
print('Start cross val')
cv_score = cross_val_score(model,X_train,scoring=metric,cv=5)
# Then visualize the score you just obtain using mean,std or plot
print('Mean CV-score : ' + str(cv_score.mean()))

# Then I tune the hyper parameters of the best (or top-n best) model using an other cross-val
for param in my_param:
    model = model_with_param
    cv_score = cross_val_score(model,cv=5)
    print('Mean CV-score with param: ' + str(cv_score.mean()))

# Now I have best parameters for the model,I can train the final model
model = model_with_best_parameters
model.fit(X_train,y_train)

# And finally test your tuned model on the test set
y_pred = model.predict(X_test)
plot_or_print_metric(y_pred,y_test)

简短答案：否

长答案。 如果您不想在最初通常分为K-fold validation的情况下使用train/test。

有很多方法可以评估模型。最简单的方法是使用train/test拆分，将模型拟合到train集上，并使用test进行评估。

如果采用交叉验证方法，则可以在每次折叠/迭代期间直接进行拟合/评估。

由您决定选择什么，但是我会选择K折或LOOCV。

图中的K折程序总结（对于K = 5）：

在Sci-Kit Learn中为K折交叉验证拆分数据集

likexiaoshuang 回答：在Sci-Kit Learn中为K折交叉验证拆分数据集

大家都在问