2 years ago
#67597
Baraliuh
Avoid xgb.cv from using testing data when used with the argument early_stopping_rounds (or alternative work-arounds)
I am currently working with the xgboost
package in R
.
One thing that started bothering me was when I set the early_stopping_rounds
parameters in the xgboost::xgb.cv
function. It seemed that the 1
fold left out for validation/testing was used to determine when to stop. This seems counterintuitive as CV is to evaluate model performance on out of sample data.
(Disclaimer I am mainly interested in using this for hyperparameter tuning)
Here is a small example that runs xgboost::xgb.cv
on simulated data and then compares the predictions on the real data. It shows that the predictions on the test folds are comparable with real data when early_stopping_rounds
stops the training.
# Load packages
## install.packages('vip')
library(xgboost)
library(purrr)
# For rep
set.seed(1)
# Generate some data
sim_data <- vip::gen_friedman(600)
xgb_input <- sim_data %>%
as.matrix() %>%
{xgb.DMatrix(.[,!'y' == colnames(.)], label = .[,'y' == colnames(.)])}
# Fit the model
model <- xgb.cv(
params = list(eta = .4),
data = xgb_input,
nrounds = 100,
early_stopping_rounds = 5,
nfold = 6,
print_every_n = 28,
prediction = T
)
#> [1] train-rmse:9.205993+0.066525 test-rmse:9.291344+0.511701
#> Multiple eval metrics are present. Will use test_rmse for early stopping.
#> Will train until test_rmse hasn't improved in 5 rounds.
#>
#> [29] train-rmse:0.115219+0.010687 test-rmse:1.675245+0.097033
#> Stopping. Best iteration:
#> [29] train-rmse:0.115219+0.010687 test-rmse:1.675245+0.097033
# Evaluate folds
map_dbl(model$folds,
~vip::metric_rmse(sim_data$y[.x], model$pred[.x])
) %>%
{cat(paste0(round(mean(.), 5), '+', round(sd(.), 6)))}
#> 1.67524+0.106294
Created on 2022-01-18 by the reprex package (v2.0.0)
Question: is there a way to use the utility for xgb.cv
but setting 1
fold for testing, 1
(different) fold to estimate early_stopping_rounds
, and n-2
for training?
Or do I simply have to program such a procedure myself? It is a bit unclear to me from the manual.
Or, is it better to first extract some data to use for the early_stopping_rounds
(using the watchlist
flag) and use the remainder for xgb.cv
? How would this play out in hyperparameter tuning? Would it be better to use a fixed set for the watchlist
for all parameter settings?
edit:
xgb.cv
does not seem to take the watchlist
argument. But one could leave some data out for actual testing and predict each n
models on that data.
r
xgboost
cross-validation
0 Answers
Your Answer