tidymodels is a modern, unified framework for modeling and machine learning in R. It brings together a collection of packages designed to streamline the entire ML workflow — from data preprocessing and model fitting to evaluation and tuning - all using the tidyverse philosophy.
Instead of juggling different syntax styles across modeling packages, tidymodels standardizes everything into a consistent and modular pipeline.
1.3.1 The Modeling Workflow
A typical tidymodels pipeline includes:
Preprocessing with recipes
Define how the data should be transformed (e.g., scaling, encoding, tokenizing).
Model specification with parsnip
Declare the model (e.g., linear regression, random forest, xgboost), without fitting yet.
Model fitting with workflows
Combine preprocessing and model into a single workflow and fit to training data.
Model evaluation with yardstick
Assess model accuracy, RMSE, ROC AUC, etc., on test data or resamples.
Hyperparameter tuning with tune and dials
Automatically optimize model performance by searching over parameter grids.
We begin by preparing a subset of the Gumtree property listings dataset for a supervised regression task: predicting the sale price of houses based on their features.
Filter: We restrict the dataset to only house listings that are marked for sale.
Select: We keep only variables relevant for pricing: number of bedrooms, bathrooms, square meter size, parking availability, and price.
Bucket categorical features:
bedrooms and bathrooms are converted to character to be treated as categorical variables, and values ≥ 5 are collapsed into a "5+" bucket.
Impute: Missing values in parking are replaced with "none".
Drop missing data: Any rows with remaining NAs (e.g. in size_sqm or price) are removed.
This results in a clean, structured dataset ready for modeling.
2 Tidy Modeling
2.1 Setup
There are 9 steps to most modeling problem (if comparing multiples):
🧾 Step 1: Setup
🧹 Step 2: Clean and Prepare the Data
🪓 Step 3: Split the Data
🥣 Step 4: Recipe (Transformation)
🔧 Step 5: Model Specs
⚙ Step 6: Workflow Set
🔁 Step 7: Resampling (CV)
🧪 Step 8: Tuning
🏆 Step 9: Compare Models
2.1.1 🪓 Step 3: Split the Data
In supervised machine learning, we must evaluate our model’s performance on data it has never seen during training. To simulate this, we split the dataset into two parts:
A training set (usually 70–80%) used to fit the model
A testing set (20–30%) used only for final performance evaluation
This mimics how the model will behave when deployed on real-world data.
Why splitting matters
If we train and evaluate the model on the same data, we risk overfitting - the model might learn patterns specific to the training set that don’t generalize. A separate testing set gives us an unbiased estimate of performance.
There are several strategies available in tidymodels:
initial_split() for a simple one-time train/test split
vfold_cv() for k-fold cross-validation (used for tuning)
bootstraps(), mc_cv() for advanced resampling strategies
We’ll use an 80/20 train/test split and remove the most extreme property prices (top 10%) to avoid model distortion from outliers.
Code
set.seed(123)df <- data %>%filter(price <quantile(price, 0.90)) # remove top 10% most expensive listings(split <-initial_split(df, prop =0.8)) # 80/20 train-test split
To confirm that the training and testing data have similar price distributions, we plot their density curves below:
Code
ggplot() +geom_density(data = train_data, aes(price), alpha =0.5, color ="red") +geom_density(data = test_data, aes(price), alpha =0.5, color ="blue") +scale_x_continuous(labels = scales::comma) +theme_minimal() +labs(title ="Price Distribution: Train (Red) vs Test (Blue)",x ="Price", y ="Density")
This step lays the foundation for model training by ensuring: - The model can generalize beyond the training data - We can compare future model versions using a consistent held-out test set
2.1.2 🥣 Step 4: Feature Engineering with recipes
Before we fit a model, we need to prepare and transform the data so that it’s in a form the model can understand. The recipe() function from the recipes package lets us build a data preprocessing pipeline, which is later combined with our model.
Instead of manually modifying the training and test sets, recipes lets you declare a series of preprocessing steps once and apply them consistently during model fitting and prediction.
It handles:
Categorical encoding
Normalization or standardization
Imputation of missing values
Text tokenization
Interaction terms and basis expansions
And more…
We’ll perform two key transformations: 1. One-hot encoding for categorical variables 2. Normalization of numeric predictors
Code
housing_rec <-recipe(price ~ ., data = train_data) %>%step_dummy(all_nominal_predictors()) %>%step_normalize(all_numeric_predictors())housing_rec
Step
Needed for
Improves Model?
Why
step_dummy()
All models
✅ Yes
Converts categorical variables (like bedrooms, parking) into numeric dummy variables that models can interpret
step_normalize()
Ridge/Lasso
✅ Critically
Ensures all numeric predictors are on the same scale — especially important when applying penalties
OLS
🚫 Not required
Normalization is not necessary for ordinary least squares, but helps interpret coefficients
Tree Models
🚫 Not required
Tree-based models (e.g., random forests) are scale-invariant
🧭 Explore more: The recipes package supports many more transformation steps like splines, PCA, log transforms, and imputation. View All ‘Steps’ Here
This recipe is a modular, reusable specification of how your features should be prepared for modeling. You’ll plug it into a workflow() in the next steps.
2.2 Regularization
2.2.1 🔧 Step 5: Model Specs (Single Model - Regularization)
Now that our data is ready and our recipe is defined, the next step is to specify the type of model we want to use. We’ll start with a regularized linear regression model called Elastic Net.
Elastic Net is a type of linear regression that includes regularization - it penalizes large coefficients to reduce overfitting.
It combines two popular penalties:
Ridge (L2 penalty): Shrinks coefficients smoothly - good when all predictors are useful.
Lasso (L1 penalty): Can shrink some coefficients all the way to zero - good for automatic variable selection.
Elastic Net blends these via two hyperparameters:
penalty: Controls how strong the overall regularization is.
mixture: Controls the mix between Ridge (0) and Lasso (1):
mixture = 0: Pure Ridge
mixture = 1: Pure Lasso
mixture = 0.5: Balanced blend (Elastic Net)
We use the parsnip package (loaded via tidymodels) to declare a model specification. Think of this like defining a blueprint - we’re not fitting anything yet.
Code
model_spec <-linear_reg(penalty =tune(), # "tune()" tells tidymodels to find the best value latermixture =tune() # Also tuning the L1/L2 mix) %>%set_engine("glmnet") # Use glmnet package under the hood
This creates a model spec that supports tuning. You can check how tidymodels translates this into engine-specific code:
Code
model_spec %>%translate()
Linear Regression Model Specification (regression)
Main Arguments:
penalty = tune()
mixture = tune()
Computational engine: glmnet
Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
alpha = tune(), family = "gaussian")
🔍 translate() shows the actual call that will be sent to the glmnet engine (e.g. glmnet::glmnet(...)).
At this point, we’ve:
Selected a regularized model suitable for numeric regression
Enabled tuning of its two core hyperparameters
Specified the engine that will handle the model fitting
Next, we’ll plug this into a workflow that combines the model and recipe together.
2.2.2 ⚙ Step 6: Building the Workflow
In tidymodels, a workflow bundles together everything needed to fit and evaluate a model:
The preprocessing recipe
The model specification
Optional: formula or variable roles
This ensures that all steps (e.g., dummy variables, normalization) are applied consistently during training, validation, and prediction.
Using a workflow() improves reproducibility and reduces errors all transformations are tracked and applied in the correct order.
2.2.3 🔁 Step 7: Cross-Validation
To evaluate the model and tune hyperparameters, we use cross-validation (CV).
Why CV?
It avoids relying on a single train/test split.
It gives a more robust estimate of model performance.
It’s critical when we’re tuning parameters like penalty and mixture.
We’ll use 5-fold cross-validation: the training set is split into 5 parts (folds), and each fold takes a turn as the validation set while the others are used for training.
Code
(cv_folds <-vfold_cv(train_data, v =5, repeats =1))
vfold_cv() returns 5 sets of training/validation splits, each used once during tuning.
With the workflow and folds ready, we can now tune the hyperparameters using grid search.
2.2.4 🔍 Step 8: Hyperparameter Tuning
Now that we’ve defined our model and workflow and set up cross-validation, we move on to tuning the hyperparameters. This step helps us find the combination of penalty (regularization strength) and mixture (Ridge/Lasso balance) that minimizes prediction error.
Elastic Net has no default hyperparameters - model performance depends heavily on: - penalty: controls how strongly we shrink coefficients. - mixture: determines whether we’re closer to Ridge (0) or Lasso (1).
We’ll search across combinations of both to see what performs best using 5-fold CV.
We use a regular grid to search across 25 combinations (5 values for each parameter).
This creates a structured grid of values: - penalty values spaced logarithmically from small to large - mixture values from 0 (pure Ridge) to 1 (pure Lasso)
This fits the model across all grid combinations and CV folds, returning performance metrics for each.
We now inspect how penalty and mixture affect RMSE:
Code
model_res %>%collect_metrics() %>%ggplot(aes(x = penalty, y = mean, color =as.factor(mixture))) +geom_point() +geom_line() +scale_x_log10(labels = scales::label_number()) +facet_wrap(~.metric, scales ="free_y") +labs(title ="Elastic Net Performance",x ="Penalty (log scale)",y ="Metric value",color ="Mixture" ) +theme_minimal()
This plot helps us visually select the best-performing configuration - the one with the lowest RMSE.
In the next step, we’ll use the best combination to train a final model on the full training set and evaluate it on the test set.
2.2.5 🧾 Step 9: Final Fit and Evaluation
After tuning, we now select the best combination of hyperparameters, retrain the model on the full training data, and evaluate its performance on the held-out test set.
We first extract the hyperparameter values that gave the lowest RMSE during cross-validation:
This gives you the model’s intercept and variable coefficients (some may be zero due to Lasso shrinkage).
Sample output:
Term
Estimate
Interpretation
(Intercept)
2,810,433
Baseline house price when all predictors are zero
size_sqm
37,151
For each additional square meter, price increases by ~R37,151 (holding all else constant)
bedrooms_X3
137,766
A 3-bedroom house is worth ~R138k more than a 1-bedroom baseline
bathrooms_X2
533,496
Having 2 bathrooms increases the expected price by ~R533k
parking_garage
281,631
Having a garage adds ~R282k to the value compared to the baseline parking type
ℹ️ Note: Categorical variables are represented using dummy variables (one-hot encoding). For example, bedrooms_X3 means “3 bedrooms”, with the omitted base level likely being “1 bedroom”.
What about the penalty?
The penalty column shows the regularization value applied to each coefficient - it’s the same across all terms here (1e-10) because Elastic Net applies it globally. In your case, the model applied minimal shrinkage, meaning most coefficients were retained.
If your penalty were larger, some coefficients might have been reduced toward zero (especially for weak predictors), improving generalization.
This output allows you to interpret the economic meaning of each predictor and verify whether the relationships make sense. If you’d like, you can now plot: - Predicted vs. actual prices - Coefficient magnitudes (feature importance)
Finally, we assess how well the model generalizes using the unseen test set:
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1156311.
2 rsq standard 0.418
3 mae standard 880710.
Output:
Metric
Value
Interpretation
RMSE
R1,156,321
On average, model predictions are off by ±R1.15 million
R²
0.418
About 42% of the variation in price is explained by the model
MAE
R880,720
The typical absolute error (non-directional) is around R881k
RMSE (Root Mean Squared Error): Sensitive to large errors. Indicates the model’s average error magnitude.
R² (Coefficient of Determination): Measures the proportion of variance in the target variable that is explained by the model.
A value of 0.418 means the model captures moderate structure - there’s room to improve.
MAE (Mean Absolute Error): More robust to outliers. It tells us the average error in Rand terms.
2.3 Trees
2.3.1 🌲 Step 5: Model Specification - Random Forest
Let’s now try a tree-based model: the Random Forest. This is a powerful and flexible machine learning algorithm that often performs well out of the box, especially with tabular data.
A Random Forest is an ensemble method built on top of decision trees. It works by: - Fitting many trees on bootstrap samples of the training data - Using a random subset of predictors at each split (mtry) - Averaging the predictions from all trees to reduce variance
This makes it:
Robust to overfitting
Good at handling nonlinearities and interactions
Insensitive to scaling (no need to normalize variables)
We’ll use parsnip::rand_forest() to define a regression forest model with:
trees = 100: number of trees in the forest
min_n = 10: minimum number of observations in a node before it’s split
mtry = tune(): number of predictors to randomly select at each split (this will be tuned)
Code
( model_spec <-rand_forest(mtry =tune(), # number of variables to try at each split (tuned)min_n =10, # control tree complexitytrees =100# total trees in the forest ) %>%set_engine("ranger", num.threads =10, importance ="permutation") %>%# high-performance engineset_mode("regression"))
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = 100
min_n = 10
Engine-Specific Arguments:
num.threads = 10
importance = permutation
Computational engine: ranger
You can inspect the low-level call with:
Code
model_spec %>%translate()
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
trees = 100
min_n = 10
Engine-Specific Arguments:
num.threads = 10
importance = permutation
Computational engine: ranger
Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
mtry = min_cols(~tune(), x), num.trees = 100, min.node.size = min_rows(~10,
x), num.threads = 10, importance = "permutation", verbose = FALSE,
seed = sample.int(10^5, 1))
Unlike linear models, Random Forests don’t require normalization or dummy encoding - they handle raw data well, especially with categorical features already converted to factors.
In the next step, we’ll plug this model into the same workflow as before, reuse our recipe, and set up cross-validation and tuning.
2.3.4 🔍 Step 8: Hyperparameter Tuning – Random Forest
Random forests can be very effective with minimal tuning, but one important parameter to optimize is mtry: the number of predictors randomly considered at each split.
If mtry is too small → each tree sees limited information → high bias.
If mtry is too large → all trees become similar → less diversity, higher variance.
The sweet spot depends on your dataset’s structure.
We define a regular grid over possible values of mtry from 2 to 10.
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1096200.
2 rsq standard 0.477
3 mae standard 808219.
This returns: - RMSE (Root Mean Squared Error) - MAE (Mean Absolute Error) - R² (Proportion of variance explained)
Metric
Value
Interpretation
RMSE
R1,097,054
On average, predictions deviate from actual prices by ±R1.1 million
R²
0.477
~48% of the variation in price is explained by the model
MAE
R809,189
The typical absolute error in prediction is around R809k
Compared to the Elastic Net model: - ✅ Lower RMSE (better) - ✅ Lower MAE (better) - ✅ Higher R² (better explanatory power)
This suggests that the Random Forest model outperforms the linear model in this case - likely because it better captures nonlinear interactions between size, room counts and parking type.
🧭 From here, you can continue by:
Plotting predicted vs actual values to check residual patterns
Extracting feature importance (vip::vip())
Comparing models formally side-by-side
Feature Importance with vip
After fitting the final XGBoost model, we can inspect which features had the most influence on predictions using the vip package.
This chart shows the top 20 most influential features, helping you interpret which variables drive predictions in your model.
2.4 Bayesian Optimisation
2.4.1 🔧 Step 5: Model Specification – Boosted Trees (XGBoost)
In this step, we define a more complex machine learning model using XGBoost - one of the most powerful and widely-used gradient boosting algorithms. It performs well on a wide range of structured (tabular) classification tasks.
XGBoost (Extreme Gradient Boosting) builds an ensemble of decision trees sequentially, where each tree attempts to correct the mistakes of the previous ones. It supports:
Fine control over model complexity
Regularization to prevent overfitting
Fast computation, even with large datasets
Here we define a tunable boosted tree model using parsnip::boost_tree() and the xgboost engine.
Code
model_spec <-boost_tree(trees =tune(), # Total number of trees (iterations)tree_depth =tune(), # Maximum depth of each treelearn_rate =tune(), # Learning rate (shrinkage)loss_reduction =tune(), # Minimum loss reduction required to make a further splitsample_size =tune(), # Subsample ratio of rowsmtry =tune() # Number of predictors sampled at each split (colsample_bytree)) %>%set_engine("xgboost") %>%set_mode("regression")
What we’re tuning:
Parameter
Meaning
Regularization?
Notes
tree_depth
Max depth of a tree
✅ Yes
Controls complexity —-deeper trees fit more
learn_rate
Shrinkage step size
✅ Yes
Lower = more robust but slower
loss_reduction
Minimum gain to split
✅ Yes
Like gamma in LightGBM —-prunes weak splits
sample_size
Row subsampling rate (0–1)
✅ Yes
Helps prevent overfitting
mtry
Number of predictors randomly selected per tree
✅ Yes
Reduces correlation between trees
trees
Total number of boosting rounds
No
Tune later if needed for final model
This setup gives XGBoost the flexibility to learn nonlinear boundaries while controlling for overfitting using multiple regularization mechanisms.
📌 You’ll tune this model using tune_bayes() in upcoming steps - be mindful of runtime since this grid has high dimensionality.
Given the number of hyperparameters and resampling iterations in our Bayesian tuning, it’s efficient to run the process in parallel across multiple CPU cores. This can significantly reduce training time.
Why use parallelism?
Each fold x parameter combination is an independent model fit, so it’s ideal for parallel execution. The tune package supports this via the doFuture and future ecosystems.
Code
library(doFuture) # Backend for parallel model trainingregisterDoFuture() # Registers the parallel backendplan(multisession) # Launches a separate R session per core
💡 multisession is safe and works across platforms (Windows, macOS, Linux).
On Linux/macOS, plan(multicore) is slightly faster but not supported on Windows.
Once this setup is active, all tune_bayes(), tune_grid(), and fit_resamples() calls will automatically distribute work across available CPU cores.
You can check how many are being used with:
Code
future::availableCores()
system
16
Or manually set the number of workers:
Code
plan(multisession, workers =4) # Use 4 cores explicitly
Now you’re ready to run large-scale tuning efficiently!
Code
model_bayes <-tune_bayes( model_workflow,resamples = cv_folds, # 5-fold CV foldsparam_info = param_final, # Parameter space (use finalize())metrics =metric_set(rmse), # Regression metricinitial =10, # Start with 5 random configurationsiter =20, # Then refine with 20 adaptive stepscontrol =control_bayes(verbose =TRUE, time_limit =5, # 5 minutesverbose_iter =TRUE,no_improve =5# Early stopping after 5 bad iterations ))
Argument
Purpose
initial = 5
Try 5 random combinations first to “seed” the model
iter = 20
Then perform 20 smart iterations based on model feedback
no_improve
Stop early if no improvement in the last 5 steps
param_info
The parameter ranges to explore — use parameters() or finalize() to define these
metrics
Evaluation metric — here we’re maximizing accuracy
📦 tune_bayes() is part of the tune package and requires that your parameter set (param_final) be created with extract_parameter_set_dials() or refined with finalize().
Code
model_bayes %>%select_best(metric ="rmse")
The table below shows the optimal combination of hyperparameters selected by Bayesian optimization:
mtry
trees
tree_depth
learn_rate
loss_reduction
sample_size
.config
2
1700
14
0.0130
0.00125
0.355
Iter1
Interpretation of parameters
Parameter
Value
Meaning
mtry
2
Randomly sample 2 predictors at each tree split
trees
1700
Model was trained over 1700 boosting rounds
tree_depth
14
Each tree can grow up to 14 levels deep
learn_rate
0.013
Low learning rate (slow but steady improvements)
loss_reduction
0.00125
Minimum gain required to split - small value allows deeper exploration
sample_size
0.355
Only ~35% of training data rows were used in each boosting iteration
.config
Iter1
Configuration ID in the tuning process
This model favors many deep trees with a low learning rate, which is a typical signature of a strong gradient boosting model.
Next, you’ll finalize the model using these values and fit it to your full training data.
2.5 Multiple Models
So far we’ve trained models one by one. Now let’s take it a step further and test multiple models in parallel using the workflowsets package. This lets us compare performance across different modeling approaches in a clean and organized way.
We’ll compare:
Ordinary Least Squares (OLS) - no regularization
Elastic Net - regularized linear model (with tuning)
Random Forest - nonlinear tree ensemble
2.5.1 🔧 Step 5: Model Spec
We first declare our model blueprints using parsnip. These are not fitted yet.
We’re using a fixed Random Forest (no tuning), a fully tunable Elastic Net, and a baseline OLS model.
2.5.2 ⚙ Step 6: Building the workflow_set()
We now create a single object containing all model–recipe combinations:
Code
( wf_set <-workflow_set(preproc =list(housing = housing_rec), # one shared recipemodels =list(ols = ols_spec,RF = rf_spec,xgboost = xgboost_spec,elastic_spec = elastic_spec # unnamed here → name is taken from object ) ))
ggplot(rsq_summary, aes(x =reorder(wflow_id, mean_rsq), y = mean_rsq)) +geom_point(size =3, color ="steelblue") +geom_errorbar(aes(ymin = mean_rsq - std_err, ymax = mean_rsq + std_err),width =0.2, color ="darkgray") +labs(title ="Model Comparison by R²",x ="Model",y ="Mean R²" ) +theme_minimal()
A model with higher R² and narrow error bars indicates both strong and stable performance across folds.
Use this visualization alongside RMSE/MAE comparisons to determine: - Which model fits the data best overall - Whether gains in predictive power justify model complexity