我正在尝试手动编写R代码以创建集成的机器学习模型(用于监督的二进制响应分类)。我了解R中已有一些软件包,您可以创建整体模型(例如caretEnsemble)-但是,我使用的计算机只能访问非常有限的R软件包,因此需要手工编写此过程。
我想创建一个集成模型,其中第一层使用“随机森林”算法和“自适应”算法。然后将第一层的结果传递到第二层,在第二层中,将“ xgboost”算法用于最终分类。
我已经在下面使用“声纳”数据集附加了我的代码(可复制)。
我的问题:
-
下面的代码正确吗?也就是说,代码是否准确反映了制作整体模型所涉及的步骤?
-
是否应该使用整个训练数据集再次训练第一层(在构建第二层之后)?
-
是否可以将两层(第一层和第二层)组合在一起,以便仅对预测的数据进行一次馈送?
`库(mlbench)
库(xgboost)
图书馆(randomForest)
图书馆(加拿大)
library(caret)
data (Sonar)
index1 = createDataPartition(y=Sonar$Class,p=0.75,list=FALSE)
train_set = Sonar[index1,]
stackset = Sonar[-index1,] ## for testing final ensemble model
index2 = createDataPartition(y=train_set$Class,p=0.67,list=FALSE)
trainset = train_set[index2,] ## for training the first layer models,randomforest and ada
testset = train_set[-index2,] ## for traing the second layer model,xgboost
#Defining the training control
fitControl <- trainControl(
method = "cv",number = 10,savePredictions = 'final',# To save out of fold predictions for best parameter combinantions
classprobs = T # To save the class probabilities of the out of fold predictions)
#Training the random forest model
model_rf<-train(trainset[,-61],trainset$Class,method='rf',trControl=fitControl,tuneLength=10)
#Training the ada model
model_ada<-train(trainset[,method='ada',tuneLength=5)
#Predicting probabilities for the testset data
testset$pred_rf<-predict(model_rf,testset[,type='prob')$M
testset$pred_ada<-predict(model_ada,type='prob')$M
############## fit 2nd layer: xgboost model with predicted probabilities from earlier models
predictors<-c('pred_rf','pred_ada')
model_xgboost<-
train(testset[,predictors],testset$Class,method='xgbTree',tuneLength=3)
##### Usual approach:
### Predicting probabilities for the stackset data:
stackset$pred_rf<-predict(model_rf,stackset[,type='prob')$M
stackset$pred_ada<-predict(model_ada,type='prob')$M
### finally predict completely new data with the XGBOOST model trained earlier:
stackset$pred_xgboost<-predict(model_xgboost,predictors])
confusionmatrix(stackset$Class,stackset$pred_xgboost)
############## OR is this a better approach?: fit 1st layer models again on whole training dataset
model_rf_final<-train(train_set[,train_set$Class,tuneLength=10)
model_ada_final<-train(train_set[,tuneLength=5)
##### now predict probabilities for the stackset data:
stackset$pred_rf<-predict(model_rf_final,type='prob')$M
stackset$pred_ada<-predict(model_ada_final,type='prob')$M
### finaly predict from the XGBOOST model trained earlier:
stackset$pred_xgboost<-predict(model_xgboost,stackset$pred_xgboost)
` 谢谢