1 year ago

#74400

test-img

Leonhard Geisler

How to correctly set seeds in a tidymodels workflow?

Setting seeds to retrieve reproducible results is crucial in machine learning. To do so, I am using the set.seed() function inside my tidymodels workflow.

However, I am still getting slightly different results for the following workflow (churn classification task for banking customer churn):

# split data
set.seed(seed = 1972) 
train_test_split <-
  rsample::initial_split(
    data = data,     
    prop = 0.80   
  ) 
train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing() 

After this I create and bake a recipe. Than I create my resampling cv data:

# cv
set.seed(123)
folds <-
  recipes::bake(
    recipe,
    new_data = training(train_test_split)
  ) %>%
  rsample::vfold_cv(v = 10)

Now I specify the model, a latin hypercube grid for tuning and a workflow object. Than I start the tuning process:

# xgboost model
xgb_spec <- boost_tree(
  trees = tune(), 
  tree_depth = tune(), min_n = tune(), 
  loss_reduction = tune(),                     ## first three: model complexity
  sample_size = tune(), mtry = tune(),         ## randomness
  learn_rate = tune()#,                        ## step size
  #stop_iter = tune()
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

# hypercube
xgb_grid <- grid_latin_hypercube(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train_baked),
  learn_rate(),
  #stop_iter(range = c(10L,100L)),
  size = 30
)

# wfl
xgboost_wf <- 
  workflows::workflow() %>%
  add_model(xgb_spec) %>% 
  add_formula(Exited ~ .)

# tune
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
xgboost_tuned <- tune::tune_grid(
  object = xgboost_wf,
  resamples = folds,
  grid = xgb_grid,
  control = tune::control_grid(verbose = TRUE)
)
stopCluster(cl)

When the model is tuned, I chose the best parameters, finalize the model, evaluate the models performance and feature importance, predict and modify the threshold via Youden's index.

Has this to do with the latin hypercube, or do I have got a wrong understanding of the sampling functions?

r

machine-learning

cross-validation

sampling

tidymodels

0 Answers

Your Answer

Accepted video resources