1 year ago
#74400

Leonhard Geisler
How to correctly set seeds in a tidymodels workflow?
Setting seeds to retrieve reproducible results is crucial in machine learning. To do so, I am using the set.seed() function inside my tidymodels workflow.
However, I am still getting slightly different results for the following workflow (churn classification task for banking customer churn):
# split data
set.seed(seed = 1972)
train_test_split <-
rsample::initial_split(
data = data,
prop = 0.80
)
train_tbl <- train_test_split %>% training()
test_tbl <- train_test_split %>% testing()
After this I create and bake a recipe. Than I create my resampling cv data:
# cv
set.seed(123)
folds <-
recipes::bake(
recipe,
new_data = training(train_test_split)
) %>%
rsample::vfold_cv(v = 10)
Now I specify the model, a latin hypercube grid for tuning and a workflow object. Than I start the tuning process:
# xgboost model
xgb_spec <- boost_tree(
trees = tune(),
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## first three: model complexity
sample_size = tune(), mtry = tune(), ## randomness
learn_rate = tune()#, ## step size
#stop_iter = tune()
) %>%
set_engine("xgboost") %>%
set_mode("classification")
# hypercube
xgb_grid <- grid_latin_hypercube(
trees(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train_baked),
learn_rate(),
#stop_iter(range = c(10L,100L)),
size = 30
)
# wfl
xgboost_wf <-
workflows::workflow() %>%
add_model(xgb_spec) %>%
add_formula(Exited ~ .)
# tune
cores <- parallel::detectCores(logical = FALSE)
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
set.seed(234)
xgboost_tuned <- tune::tune_grid(
object = xgboost_wf,
resamples = folds,
grid = xgb_grid,
control = tune::control_grid(verbose = TRUE)
)
stopCluster(cl)
When the model is tuned, I chose the best parameters, finalize the model, evaluate the models performance and feature importance, predict and modify the threshold via Youden's index.
Has this to do with the latin hypercube, or do I have got a wrong understanding of the sampling functions?
r
machine-learning
cross-validation
sampling
tidymodels
0 Answers
Your Answer