How to correctly set seeds in a tidymodels workflow?

1 year ago

#74400

Leonhard Geisler

Setting seeds to retrieve reproducible results is crucial in machine learning. To do so, I am using the set.seed() function inside my tidymodels workflow.

However, I am still getting slightly different results for the following workflow (churn classification task for banking customer churn):

# split data
set.seed(seed = 1972) 
train_test_split <-
  rsample::initial_split(
    data = data,     
    prop = 0.80   
  ) 
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()

After this I create and bake a recipe. Than I create my resampling cv data:

# cv
set.seed(123)
folds <-
  recipes::bake(
    recipe,
    new_data = training(train_test_split)
  ) %>%
  rsample::vfold_cv(v = 10)

Now I specify the model, a latin hypercube grid for tuning and a workflow object. Than I start the tuning process:

# xgboost model
xgb_spec <- boost_tree(
  trees = tune(), 
  tree_depth = tune(), min_n = tune(), 
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()#,                        ## step size
  #stop_iter = tune()
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# hypercube
xgb_grid <- grid_latin_hypercube(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train_baked),
  learn_rate(),
  #stop_iter(range = c(10L,100L)),
  size = 30
)

# wfl
xgboost_wf <- 
  workflows::workflow() %>%
  add_model(xgb_spec) %>% 
  add_formula(Exited ~ .)

# tune
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
xgboost_tuned <- tune::tune_grid(
  object = xgboost_wf,
  resamples = folds,
  grid = xgb_grid,
  control = tune::control_grid(verbose = TRUE)
)
stopCluster(cl)

When the model is tuned, I chose the best parameters, finalize the model, evaluate the models performance and feature importance, predict and modify the threshold via Youden's index.

Has this to do with the latin hypercube, or do I have got a wrong understanding of the sampling functions?

machine-learning

cross-validation

sampling

tidymodels

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs