我正在尝试编写自己的梯度提升算法。我知道有Get-Location
和gbm
之类的现有软件包,但我想通过编写自己的软件包来了解算法的工作原理。
我正在使用xgboost,
数据集,结果是iris
(连续)。我的损失函数是Sepal.Length
(基本上是前面有1/2的均方误差),所以我对应的梯度就是残差mean(1/2*(y-yhat)^2)
。我正在将预测值初始化为0。
y - yhat
通过此操作,我将library(rpart)
data(iris)
#Define gradient
grad.fun <- function(y,yhat) {return(y - yhat)}
mod <- list()
grad_boost <- function(data,learning.rate,M,grad.fun) {
# Initialize fit to be 0
fit <- rep(0,nrow(data))
grad <- grad.fun(y = data$Sepal.Length,yhat = fit)
# Initialize model
mod[[1]] <- fit
# Loop over a total of M iterations
for(i in 1:M){
# Fit base learner (tree) to the gradient
tmp <- data$Sepal.Length
data$Sepal.Length <- grad
base_learner <- rpart(Sepal.Length ~ .,data = data,control = ("maxdepth = 2"))
data$Sepal.Length <- tmp
# Fitted values by fitting current model
fit <- fit + learning.rate * as.vector(predict(base_learner,newdata = data))
# Update gradient
grad <- grad.fun(y = data$Sepal.Length,yhat = fit)
# Store current model (index is i + 1 because i = 1 contain the initialized estiamtes)
mod[[i + 1]] <- base_learner
}
return(mod)
}
数据集拆分为训练和测试数据集,并将模型拟合为该数据集。
iris
现在,我从train.dat <- iris[1:100,]
test.dat <- iris[101:150,]
learning.rate <- 0.001
M = 1000
my.model <- grad_boost(data = train.dat,learning.rate = learning.rate,M = M,grad.fun = grad.fun)
计算预测值。对于my.model
,拟合值为my.model
。
0 (vector of initial estimates) + learning.rate * predictions from tree 1 + learning rate * predictions from tree 2 + ... + learning.rate * predictions from tree M
我有几个问题
- 我的梯度增强算法看起来正确吗?
- 我是否正确计算了预测值
yhats.mymod <- apply(sapply(2:length(my.model),function(x) learning.rate * predict(my.model[[x]],newdata = test.dat)),1,sum) # Calculate RMSE > sqrt(mean((test.dat$Sepal.Length - yhats.mymod)^2)) [1] 2.612972
?