2 years ago

#67597

test-img

Baraliuh

Avoid xgb.cv from using testing data when used with the argument early_stopping_rounds (or alternative work-arounds)

I am currently working with the xgboost package in R. One thing that started bothering me was when I set the early_stopping_rounds parameters in the xgboost::xgb.cv function. It seemed that the 1 fold left out for validation/testing was used to determine when to stop. This seems counterintuitive as CV is to evaluate model performance on out of sample data.

(Disclaimer I am mainly interested in using this for hyperparameter tuning)

Here is a small example that runs xgboost::xgb.cv on simulated data and then compares the predictions on the real data. It shows that the predictions on the test folds are comparable with real data when early_stopping_rounds stops the training.

# Load packages
## install.packages('vip')
library(xgboost)
library(purrr)

# For rep
set.seed(1)
# Generate some data
sim_data <- vip::gen_friedman(600)
xgb_input <- sim_data %>% 
  as.matrix() %>% 
  {xgb.DMatrix(.[,!'y' == colnames(.)], label = .[,'y' == colnames(.)])}

# Fit the model
model <- xgb.cv(
  params = list(eta = .4),
  data = xgb_input, 
  nrounds = 100, 
  early_stopping_rounds = 5,
  nfold = 6,
  print_every_n = 28,
  prediction = T
)
#> [1]  train-rmse:9.205993+0.066525    test-rmse:9.291344+0.511701 
#> Multiple eval metrics are present. Will use test_rmse for early stopping.
#> Will train until test_rmse hasn't improved in 5 rounds.
#> 
#> [29] train-rmse:0.115219+0.010687    test-rmse:1.675245+0.097033 
#> Stopping. Best iteration:
#> [29] train-rmse:0.115219+0.010687    test-rmse:1.675245+0.097033
# Evaluate folds
map_dbl(model$folds,
    ~vip::metric_rmse(sim_data$y[.x], model$pred[.x])
) %>% 
  {cat(paste0(round(mean(.), 5), '+', round(sd(.), 6)))}
#> 1.67524+0.106294

Created on 2022-01-18 by the reprex package (v2.0.0)

Question: is there a way to use the utility for xgb.cv but setting 1 fold for testing, 1 (different) fold to estimate early_stopping_rounds, and n-2 for training?

Or do I simply have to program such a procedure myself? It is a bit unclear to me from the manual.

Or, is it better to first extract some data to use for the early_stopping_rounds (using the watchlist flag) and use the remainder for xgb.cv? How would this play out in hyperparameter tuning? Would it be better to use a fixed set for the watchlist for all parameter settings?

edit: xgb.cv does not seem to take the watchlist argument. But one could leave some data out for actual testing and predict each n models on that data.

r

xgboost

cross-validation

0 Answers

Your Answer

Accepted video resources