Practical - Supervised ML

1 libraries

See https://keras3.posit.co/articles/getting_started.html

And https://blogs.rstudio.com/ai/posts/2017-12-20-time-series-forecasting-with-recurrent-neural-networks/

1.1 Basic setup

Code
library(tidyverse)
library(knitr)
library(glue)
library(bertheme)
library(tidymodels)
library(modeltime)

knitr::opts_chunk$set(
  echo = TRUE,
  eval = TRUE,
  message = FALSE,
  warning = FALSE,
  fig.width = 6,
  fig.height = 5,
  fig.align='center',
  cache = FALSE
  )

1.2 Functions

1.3 Tidymodels Package

tidymodels is a modern, unified framework for modeling and machine learning in R. It brings together a collection of packages designed to streamline the entire ML workflow — from data preprocessing and model fitting to evaluation and tuning - all using the tidyverse philosophy.

Instead of juggling different syntax styles across modeling packages, tidymodels standardizes everything into a consistent and modular pipeline.

1.3.1 The Modeling Workflow

A typical tidymodels pipeline includes:

  1. Preprocessing with recipes
    Define how the data should be transformed (e.g., scaling, encoding, tokenizing).

  2. Model specification with parsnip
    Declare the model (e.g., linear regression, random forest, xgboost), without fitting yet.

  3. Model fitting with workflows
    Combine preprocessing and model into a single workflow and fit to training data.

  4. Model evaluation with yardstick
    Assess model accuracy, RMSE, ROC AUC, etc., on test data or resamples.

  5. Hyperparameter tuning with tune and dials
    Automatically optimize model performance by searching over parameter grids.

1.3.2 📦 Load the Core Packages

Code
# install.packages("tidymodels")
# install.packages("bonsai")
library(tidymodels)
library(bonsai)

💡 This loads the key packages in the ecosystem, including:

  • recipes: preprocessing
  • parsnip: model specification
  • workflows: pipelining
  • rsample: data splitting
  • yardstick: performance metrics
  • tune: tuning and resampling

The list of ALL models can be found here: https://www.tidymodels.org/find/parsnip/

1.4 Data

We begin by preparing a subset of the Gumtree property listings dataset for a supervised regression task: predicting the sale price of houses based on their features.

Code
df <- read_csv("data/gumtree_clean.zip")
data <- df %>% filter(type == 'sales', 
              dwelling_type == "house") %>% 
  select(bedrooms, bathrooms, size_sqm, 
         parking, price) %>% 
  mutate(across(contains('rooms'), 
                ~case_when(
                  .x < 5 ~ as.character(.x), 
                  TRUE ~ '5+'))) %>% 
  mutate(parking = ifelse(is.na(parking), 
                          'none', parking)) %>% 
  drop_na()

What this does:

  • Filter: We restrict the dataset to only house listings that are marked for sale.
  • Select: We keep only variables relevant for pricing: number of bedrooms, bathrooms, square meter size, parking availability, and price.
  • Bucket categorical features:
    • bedrooms and bathrooms are converted to character to be treated as categorical variables, and values ≥ 5 are collapsed into a "5+" bucket.
  • Impute: Missing values in parking are replaced with "none".
  • Drop missing data: Any rows with remaining NAs (e.g. in size_sqm or price) are removed.

This results in a clean, structured dataset ready for modeling.

2 Tidy Modeling

2.1 Setup

There are 9 steps to most modeling problem (if comparing multiples):

🧾 Step 1: Setup

🧹 Step 2: Clean and Prepare the Data

🪓 Step 3: Split the Data

🥣 Step 4: Recipe (Transformation)

🔧 Step 5: Model Specs

⚙ Step 6: Workflow Set

🔁 Step 7: Resampling (CV)

🧪 Step 8: Tuning

🏆 Step 9: Compare Models

2.1.1 🪓 Step 3: Split the Data

In supervised machine learning, we must evaluate our model’s performance on data it has never seen during training. To simulate this, we split the dataset into two parts:

  • A training set (usually 70–80%) used to fit the model
  • A testing set (20–30%) used only for final performance evaluation

This mimics how the model will behave when deployed on real-world data.

Why splitting matters

If we train and evaluate the model on the same data, we risk overfitting - the model might learn patterns specific to the training set that don’t generalize. A separate testing set gives us an unbiased estimate of performance.

There are several strategies available in tidymodels:

  • initial_split() for a simple one-time train/test split
  • vfold_cv() for k-fold cross-validation (used for tuning)
  • bootstraps(), mc_cv() for advanced resampling strategies

We’ll use an 80/20 train/test split and remove the most extreme property prices (top 10%) to avoid model distortion from outliers.

Code
set.seed(123)

df <- data %>%
  filter(price < quantile(price, 0.90))  # remove top 10% most expensive listings

(split <- initial_split(df, prop = 0.8))  # 80/20 train-test split
<Training/Testing/Total>
<6025/1507/7532>
Code
(train_data <- training(split))
# A tibble: 6,025 × 5
   bedrooms bathrooms size_sqm parking   price
   <chr>    <chr>        <dbl> <chr>     <dbl>
 1 3        2              525 garage  3600000
 2 3        3              243 garage  4350000
 3 4        2              510 garage  3200000
 4 2        1              310 none     299000
 5 4        3              595 garage  2550000
 6 4        3              300 garage  4690000
 7 4        4              585 garage  5599000
 8 3        2              210 garage  2975000
 9 1        1               72 none    1021400
10 5+       3              952 garage  2895000
# ℹ 6,015 more rows
Code
(test_data  <- testing(split))
# A tibble: 1,507 × 5
   bedrooms bathrooms size_sqm parking   price
   <chr>    <chr>        <dbl> <chr>     <dbl>
 1 5+       4              306 garage  7249000
 2 3        3              194 garage  6600000
 3 4        3             1200 garage  4300000
 4 4        3              933 none    5999995
 5 4        4              756 garage  6500000
 6 3        2              366 garage  5750000
 7 4        3              496 garage  4595000
 8 4        4              197 none    6995000
 9 3        3              200 garage  3995000
10 4        1              108 none    4165000
# ℹ 1,497 more rows

Visualizing the Train/Test Split

To confirm that the training and testing data have similar price distributions, we plot their density curves below:

Code
ggplot() + 
  geom_density(data = train_data, aes(price), 
               alpha = 0.5, color = "red") + 
  geom_density(data = test_data, aes(price), 
               alpha = 0.5, color = "blue") + 
  scale_x_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Price Distribution: Train (Red) vs Test (Blue)",
       x = "Price", y = "Density")

This step lays the foundation for model training by ensuring: - The model can generalize beyond the training data - We can compare future model versions using a consistent held-out test set

2.1.2 🥣 Step 4: Feature Engineering with recipes

Before we fit a model, we need to prepare and transform the data so that it’s in a form the model can understand. The recipe() function from the recipes package lets us build a data preprocessing pipeline, which is later combined with our model.

Instead of manually modifying the training and test sets, recipes lets you declare a series of preprocessing steps once and apply them consistently during model fitting and prediction.

It handles:

  • Categorical encoding
  • Normalization or standardization
  • Imputation of missing values
  • Text tokenization
  • Interaction terms and basis expansions
  • And more…

We’ll perform two key transformations: 1. One-hot encoding for categorical variables 2. Normalization of numeric predictors

Code
housing_rec <- recipe(price ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

housing_rec
Step Needed for Improves Model? Why
step_dummy() All models ✅ Yes Converts categorical variables (like bedrooms, parking) into numeric dummy variables that models can interpret
step_normalize() Ridge/Lasso ✅ Critically Ensures all numeric predictors are on the same scale — especially important when applying penalties
OLS 🚫 Not required Normalization is not necessary for ordinary least squares, but helps interpret coefficients
Tree Models 🚫 Not required Tree-based models (e.g., random forests) are scale-invariant

🧭 Explore more: The recipes package supports many more transformation steps like splines, PCA, log transforms, and imputation.
View All ‘Steps’ Here

This recipe is a modular, reusable specification of how your features should be prepared for modeling. You’ll plug it into a workflow() in the next steps.


2.2 Regularization

2.2.1 🔧 Step 5: Model Specs (Single Model - Regularization)

Now that our data is ready and our recipe is defined, the next step is to specify the type of model we want to use. We’ll start with a regularized linear regression model called Elastic Net.

Elastic Net is a type of linear regression that includes regularization - it penalizes large coefficients to reduce overfitting.

It combines two popular penalties:

  • Ridge (L2 penalty): Shrinks coefficients smoothly - good when all predictors are useful.
  • Lasso (L1 penalty): Can shrink some coefficients all the way to zero - good for automatic variable selection.

Elastic Net blends these via two hyperparameters:

  • penalty: Controls how strong the overall regularization is.
  • mixture: Controls the mix between Ridge (0) and Lasso (1):
    • mixture = 0: Pure Ridge
    • mixture = 1: Pure Lasso
    • mixture = 0.5: Balanced blend (Elastic Net)

We use the parsnip package (loaded via tidymodels) to declare a model specification. Think of this like defining a blueprint - we’re not fitting anything yet.

Code
model_spec <- linear_reg(
  penalty = tune(),     # "tune()" tells tidymodels to find the best value later
  mixture = tune()      # Also tuning the L1/L2 mix
) %>%
  set_engine("glmnet")   # Use glmnet package under the hood

This creates a model spec that supports tuning. You can check how tidymodels translates this into engine-specific code:

Code
model_spec %>% translate()
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = tune()

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = tune(), family = "gaussian")

🔍 translate() shows the actual call that will be sent to the glmnet engine (e.g. glmnet::glmnet(...)).

At this point, we’ve:

  • Selected a regularized model suitable for numeric regression
  • Enabled tuning of its two core hyperparameters
  • Specified the engine that will handle the model fitting

Next, we’ll plug this into a workflow that combines the model and recipe together.

2.2.2 ⚙ Step 6: Building the Workflow

In tidymodels, a workflow bundles together everything needed to fit and evaluate a model:

  • The preprocessing recipe
  • The model specification
  • Optional: formula or variable roles

This ensures that all steps (e.g., dummy variables, normalization) are applied consistently during training, validation, and prediction.

Code
model_workflow <- workflow() %>% 
  add_recipe(housing_rec) %>%     
  add_model(model_spec)           
  • Using a workflow() improves reproducibility and reduces errors all transformations are tracked and applied in the correct order.

2.2.3 🔁 Step 7: Cross-Validation

To evaluate the model and tune hyperparameters, we use cross-validation (CV).

Why CV?

  • It avoids relying on a single train/test split.
  • It gives a more robust estimate of model performance.
  • It’s critical when we’re tuning parameters like penalty and mixture.

We’ll use 5-fold cross-validation: the training set is split into 5 parts (folds), and each fold takes a turn as the validation set while the others are used for training.

Code
(cv_folds <- vfold_cv(train_data, v = 5, repeats = 1))
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [4820/1205]> Fold1
2 <split [4820/1205]> Fold2
3 <split [4820/1205]> Fold3
4 <split [4820/1205]> Fold4
5 <split [4820/1205]> Fold5
  • vfold_cv() returns 5 sets of training/validation splits, each used once during tuning.

With the workflow and folds ready, we can now tune the hyperparameters using grid search.

2.2.4 🔍 Step 8: Hyperparameter Tuning

Now that we’ve defined our model and workflow and set up cross-validation, we move on to tuning the hyperparameters. This step helps us find the combination of penalty (regularization strength) and mixture (Ridge/Lasso balance) that minimizes prediction error.

Elastic Net has no default hyperparameters - model performance depends heavily on: - penalty: controls how strongly we shrink coefficients. - mixture: determines whether we’re closer to Ridge (0) or Lasso (1).

We’ll search across combinations of both to see what performs best using 5-fold CV.

  • We use a regular grid to search across 25 combinations (5 values for each parameter).
Code
(grid_vals <- grid_regular(
  penalty(), 
  mixture(), 
  levels = c(penalty = 5, mixture = 5)
))
# A tibble: 25 × 2
        penalty mixture
          <dbl>   <dbl>
 1 0.0000000001    0   
 2 0.0000000316    0   
 3 0.00001         0   
 4 0.00316         0   
 5 1               0   
 6 0.0000000001    0.25
 7 0.0000000316    0.25
 8 0.00001         0.25
 9 0.00316         0.25
10 1               0.25
# ℹ 15 more rows

This creates a structured grid of values: - penalty values spaced logarithmically from small to large - mixture values from 0 (pure Ridge) to 1 (pure Lasso)

Code
(
  model_res <- model_workflow %>% 
    tune_grid(
      resamples = cv_folds,
      grid = grid_vals,
      control = control_grid(save_pred = TRUE),
      metrics = metric_set(rmse)
    )
)
# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits              id    .metrics          .notes           .predictions
  <list>              <chr> <list>            <list>           <list>      
1 <split [4820/1205]> Fold1 <tibble [25 × 6]> <tibble [0 × 3]> <tibble>    
2 <split [4820/1205]> Fold2 <tibble [25 × 6]> <tibble [0 × 3]> <tibble>    
3 <split [4820/1205]> Fold3 <tibble [25 × 6]> <tibble [0 × 3]> <tibble>    
4 <split [4820/1205]> Fold4 <tibble [25 × 6]> <tibble [0 × 3]> <tibble>    
5 <split [4820/1205]> Fold5 <tibble [25 × 6]> <tibble [0 × 3]> <tibble>    

This fits the model across all grid combinations and CV folds, returning performance metrics for each.

We now inspect how penalty and mixture affect RMSE:

Code
model_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = penalty, y = mean, 
             color = as.factor(mixture))) +
  geom_point() + 
  geom_line() +
  scale_x_log10(labels = scales::label_number()) +
  facet_wrap(~.metric, scales = "free_y") +
  labs(
    title = "Elastic Net Performance",
    x = "Penalty (log scale)",
    y = "Metric value",
    color = "Mixture"
  ) +
  theme_minimal()

  • This plot helps us visually select the best-performing configuration - the one with the lowest RMSE.

In the next step, we’ll use the best combination to train a final model on the full training set and evaluate it on the test set.

2.2.5 🧾 Step 9: Final Fit and Evaluation

After tuning, we now select the best combination of hyperparameters, retrain the model on the full training data, and evaluate its performance on the held-out test set.

We first extract the hyperparameter values that gave the lowest RMSE during cross-validation:

Code
(best_params <- model_res %>% 
  show_best(metric = "rmse") %>% 
  slice(1))
# A tibble: 1 × 8
       penalty mixture .metric .estimator     mean     n std_err .config        
         <dbl>   <dbl> <chr>   <chr>         <dbl> <int>   <dbl> <chr>          
1 0.0000000001       1 rmse    standard   1122662.     5   9769. Preprocessor1_…

show_best() returns the top-performing combinations, and we take the best one using slice(1).

  • Finalize the Workflow

We now update our workflow() with the best tuning parameters so it’s ready for a final fit:

Code
(final_model <- finalize_workflow(model_workflow, best_params))
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1e-10
  mixture = 1

Computational engine: glmnet 

This “locks in” the optimal penalty and mixture values and prepares the model for training.

We fit the finalized model on the entire training set:

Code
(final_fit <- final_model %>%
  fit(data = train_data))
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "gaussian",      alpha = ~1) 

   Df  %Dev Lambda
1   0  0.00 616800
2   1  2.77 562000
3   1  5.07 512100
4   3  8.13 466600
5   4 13.38 425200
6   4 17.78 387400
7   4 21.44 353000
8   4 24.48 321600
9   4 27.00 293000
10  4 29.09 267000
11  4 30.83 243300
12  4 32.27 221700
13  5 33.53 202000
14  5 34.59 184000
15  5 35.47 167700
16  6 36.64 152800
17  6 38.22 139200
18  6 39.54 126900
19  6 40.63 115600
20  6 41.54 105300
21  6 42.30  95960
22  6 42.92  87440
23  6 43.44  79670
24  6 43.88  72590
25  7 44.24  66140
26  7 44.55  60270
27  7 44.82  54910
28  7 45.04  50030
29  7 45.22  45590
30  7 45.37  41540
31  8 45.51  37850
32  9 45.62  34490
33  9 45.72  31420
34  9 45.80  28630
35  9 45.86  26090
36  9 45.92  23770
37  9 45.96  21660
38  9 46.00  19730
39  9 46.03  17980
40  9 46.06  16380
41  9 46.08  14930
42  9 46.10  13600
43 10 46.11  12390
44 10 46.13  11290
45 10 46.14  10290
46 10 46.15   9375

...
and 27 more lines.

To inspect the fitted coefficients:

Code
final_fit %>% 
  extract_fit_parsnip() %>% 
  tidy()
# A tibble: 14 × 3
   term                   estimate      penalty
   <chr>                     <dbl>        <dbl>
 1 (Intercept)            2810433. 0.0000000001
 2 size_sqm                 37077. 0.0000000001
 3 bedrooms_X2              -9035. 0.0000000001
 4 bedrooms_X3             136450. 0.0000000001
 5 bedrooms_X4             205907. 0.0000000001
 6 bedrooms_X5.            109093. 0.0000000001
 7 bathrooms_X2            533829. 0.0000000001
 8 bathrooms_X3            776334. 0.0000000001
 9 bathrooms_X4            887490. 0.0000000001
10 parking_garage          281385. 0.0000000001
11 parking_none             42287. 0.0000000001
12 parking_off_street       -1695. 0.0000000001
13 parking_secure_parking  -46223. 0.0000000001
14 parking_street           -1018. 0.0000000001

This gives you the model’s intercept and variable coefficients (some may be zero due to Lasso shrinkage).

Sample output:

Term Estimate Interpretation
(Intercept) 2,810,433 Baseline house price when all predictors are zero
size_sqm 37,151 For each additional square meter, price increases by ~R37,151 (holding all else constant)
bedrooms_X3 137,766 A 3-bedroom house is worth ~R138k more than a 1-bedroom baseline
bathrooms_X2 533,496 Having 2 bathrooms increases the expected price by ~R533k
parking_garage 281,631 Having a garage adds ~R282k to the value compared to the baseline parking type

ℹ️ Note: Categorical variables are represented using dummy variables (one-hot encoding). For example, bedrooms_X3 means “3 bedrooms”, with the omitted base level likely being “1 bedroom”.

What about the penalty?

The penalty column shows the regularization value applied to each coefficient - it’s the same across all terms here (1e-10) because Elastic Net applies it globally. In your case, the model applied minimal shrinkage, meaning most coefficients were retained.

  • If your penalty were larger, some coefficients might have been reduced toward zero (especially for weak predictors), improving generalization.

This output allows you to interpret the economic meaning of each predictor and verify whether the relationships make sense. If you’d like, you can now plot: - Predicted vs. actual prices - Coefficient magnitudes (feature importance)

Finally, we assess how well the model generalizes using the unseen test set:

Code
predict(final_fit, test_data) %>%
  bind_cols(test_data) %>%
  metrics(truth = price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator   .estimate
  <chr>   <chr>            <dbl>
1 rmse    standard   1156311.   
2 rsq     standard         0.418
3 mae     standard    880710.   

Output:

Metric Value Interpretation
RMSE R1,156,321 On average, model predictions are off by ±R1.15 million
0.418 About 42% of the variation in price is explained by the model
MAE R880,720 The typical absolute error (non-directional) is around R881k
  • RMSE (Root Mean Squared Error): Sensitive to large errors. Indicates the model’s average error magnitude.
  • (Coefficient of Determination): Measures the proportion of variance in the target variable that is explained by the model.
    • A value of 0.418 means the model captures moderate structure - there’s room to improve.
  • MAE (Mean Absolute Error): More robust to outliers. It tells us the average error in Rand terms.

2.3 Trees

2.3.1 🌲 Step 5: Model Specification - Random Forest

Let’s now try a tree-based model: the Random Forest. This is a powerful and flexible machine learning algorithm that often performs well out of the box, especially with tabular data.

A Random Forest is an ensemble method built on top of decision trees. It works by: - Fitting many trees on bootstrap samples of the training data - Using a random subset of predictors at each split (mtry) - Averaging the predictions from all trees to reduce variance

This makes it:

  • Robust to overfitting
  • Good at handling nonlinearities and interactions
  • Insensitive to scaling (no need to normalize variables)

We’ll use parsnip::rand_forest() to define a regression forest model with:

  • trees = 100: number of trees in the forest
  • min_n = 10: minimum number of observations in a node before it’s split
  • mtry = tune(): number of predictors to randomly select at each split (this will be tuned)
Code
(
  model_spec <- rand_forest(
  mtry = tune(),        # number of variables to try at each split (tuned)
  min_n = 10,           # control tree complexity
  trees = 100           # total trees in the forest
  ) %>%
  set_engine("ranger", num.threads = 10, 
             importance = "permutation") %>%  # high-performance engine
  set_mode("regression")
)
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = 10

Engine-Specific Arguments:
  num.threads = 10
  importance = permutation

Computational engine: ranger 

You can inspect the low-level call with:

Code
model_spec %>% translate()
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = 10

Engine-Specific Arguments:
  num.threads = 10
  importance = permutation

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    mtry = min_cols(~tune(), x), num.trees = 100, min.node.size = min_rows(~10, 
        x), num.threads = 10, importance = "permutation", verbose = FALSE, 
    seed = sample.int(10^5, 1))

Unlike linear models, Random Forests don’t require normalization or dummy encoding - they handle raw data well, especially with categorical features already converted to factors.

In the next step, we’ll plug this model into the same workflow as before, reuse our recipe, and set up cross-validation and tuning.

2.3.2 ⚙ Step 6: Building the Workflow

Code
(model_workflow <- workflow() %>% 
   add_recipe(housing_rec) %>% 
   add_model(model_spec))
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = 10

Engine-Specific Arguments:
  num.threads = 10
  importance = permutation

Computational engine: ranger 

2.3.3 🔁 Step 7: Cross-Validation

Code
(cv_folds <- vfold_cv(train_data, v = 5, repeats = 1))
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [4820/1205]> Fold1
2 <split [4820/1205]> Fold2
3 <split [4820/1205]> Fold3
4 <split [4820/1205]> Fold4
5 <split [4820/1205]> Fold5

2.3.4 🔍 Step 8: Hyperparameter Tuning – Random Forest

Random forests can be very effective with minimal tuning, but one important parameter to optimize is mtry: the number of predictors randomly considered at each split.

  • If mtry is too small → each tree sees limited information → high bias.
  • If mtry is too large → all trees become similar → less diversity, higher variance.
  • The sweet spot depends on your dataset’s structure.

We define a regular grid over possible values of mtry from 2 to 10.

Code
grid_vals <- grid_regular(mtry(range = c(2, 10)), levels = 5)

This will try 5 evenly spaced values of mtry in the range from 2 to 10.

We use the same workflow() and CV folds from earlier to tune the forest:

Code
model_res <- model_workflow %>% 
  tune_grid(
    resamples = cv_folds,
    grid = grid_vals,
    control = control_grid(save_pred = TRUE),
    metrics = metric_set(rmse)
  )

Each combination is fit and evaluated across the 5 folds, and RMSE is averaged.

We now examine how different mtry values affect RMSE:

Code
model_res %>% 
  collect_metrics() %>% 
  ggplot(aes(x = mtry, y = mean)) +
  geom_point() + 
  geom_line() +
  facet_wrap(~.metric, scales = "free_y") +
  labs(
    title = "Random Forest Performance by mtry",
    x = "mtry (number of variables tried at each split)",
    y = "Mean metric value (lower = better)",
    color = "mtry"
  ) +
  theme_minimal()

2.3.5 🧾 Step 9: Final Fit and Evaluation

Now that we’ve found the best mtry value via cross-validation, we fit the final model on the full training set and evaluate it on the test set.

Code
best_params <- model_res %>% 
  show_best(metric = "rmse") %>% 
  slice(1)

This selects the top-performing mtry value based on lowest RMSE.

We plug the optimal hyperparameters into our workflow and fit it on the entire training data:

Code
(final_model <- finalize_workflow(model_workflow, best_params))
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 8
  trees = 100
  min_n = 10

Engine-Specific Arguments:
  num.threads = 10
  importance = permutation

Computational engine: ranger 
Code
(
final_fit <- final_model %>%
  fit(data = train_data)
)
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~8L,      x), num.trees = ~100, min.node.size = min_rows(~10, x), num.threads = ~10,      importance = ~"permutation", verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  100 
Sample size:                      6025 
Number of independent variables:  13 
Mtry:                             8 
Target node size:                 10 
Variable importance mode:         permutation 
Splitrule:                        variance 
OOB prediction error (MSE):       1.120483e+12 
R squared (OOB):                  0.5195796 

You can inspect the trained model with:

Code
final_fit %>% 
  extract_fit_parsnip()
parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~8L,      x), num.trees = ~100, min.node.size = min_rows(~10, x), num.threads = ~10,      importance = ~"permutation", verbose = FALSE, seed = sample.int(10^5,          1)) 

Type:                             Regression 
Number of trees:                  100 
Sample size:                      6025 
Number of independent variables:  13 
Mtry:                             8 
Target node size:                 10 
Variable importance mode:         permutation 
Splitrule:                        variance 
OOB prediction error (MSE):       1.120483e+12 
R squared (OOB):                  0.5195796 

Now let’s assess how the model performs on unseen data:

Code
predict(final_fit, test_data) %>%
  bind_cols(test_data) %>%
  metrics(truth = price, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator   .estimate
  <chr>   <chr>            <dbl>
1 rmse    standard   1096200.   
2 rsq     standard         0.477
3 mae     standard    808219.   

This returns: - RMSE (Root Mean Squared Error) - MAE (Mean Absolute Error) - (Proportion of variance explained)

Metric Value Interpretation
RMSE R1,097,054 On average, predictions deviate from actual prices by ±R1.1 million
0.477 ~48% of the variation in price is explained by the model
MAE R809,189 The typical absolute error in prediction is around R809k

Compared to the Elastic Net model: - ✅ Lower RMSE (better) - ✅ Lower MAE (better) - ✅ Higher R² (better explanatory power)

This suggests that the Random Forest model outperforms the linear model in this case - likely because it better captures nonlinear interactions between size, room counts and parking type.

🧭 From here, you can continue by:

  • Plotting predicted vs actual values to check residual patterns
  • Extracting feature importance (vip::vip())
  • Comparing models formally side-by-side

Feature Importance with vip

After fitting the final XGBoost model, we can inspect which features had the most influence on predictions using the vip package.

Code
library(vip)
fitted_model <- final_fit %>% extract_fit_parsnip()
vi(fitted_model)
# A tibble: 13 × 2
   Variable                  Importance
   <chr>                          <dbl>
 1 bathrooms_X4           907563122507.
 2 bathrooms_X3           786614035194.
 3 size_sqm               615502431684.
 4 bathrooms_X2           411326320113.
 5 parking_garage         234847851651.
 6 bedrooms_X3            103209980890.
 7 parking_none           100656170257.
 8 bedrooms_X2             88526625053.
 9 bedrooms_X4             79732129716.
10 bedrooms_X5.            57699549675.
11 parking_secure_parking   5923213096.
12 parking_street            866817662.
13 parking_off_street       -300441112.
  • The higher the importance score, the more the variable contributes to reducing error across trees.
  • In this model, number of bathrooms and size (sqm) are the dominant predictors of price.
  • Parking types and bedroom counts have moderate influence.

By default, vip() for Random Forests uses “gain” - how much a feature improves splits. You can also specify:

Code
vip(fitted_model, num_features = 20, geom = "col", 
    aesthetics = list(fill = "steelblue")) +
  theme_minimal()

This chart shows the top 20 most influential features, helping you interpret which variables drive predictions in your model.

2.4 Bayesian Optimisation

2.4.1 🔧 Step 5: Model Specification – Boosted Trees (XGBoost)

In this step, we define a more complex machine learning model using XGBoost - one of the most powerful and widely-used gradient boosting algorithms. It performs well on a wide range of structured (tabular) classification tasks.

XGBoost (Extreme Gradient Boosting) builds an ensemble of decision trees sequentially, where each tree attempts to correct the mistakes of the previous ones. It supports:

  • Fine control over model complexity
  • Regularization to prevent overfitting
  • Fast computation, even with large datasets

Here we define a tunable boosted tree model using parsnip::boost_tree() and the xgboost engine.

Code
model_spec <- boost_tree(
  trees = tune(),           # Total number of trees (iterations)
  tree_depth = tune(),      # Maximum depth of each tree
  learn_rate = tune(),      # Learning rate (shrinkage)
  loss_reduction = tune(),  # Minimum loss reduction required to make a further split
  sample_size = tune(),     # Subsample ratio of rows
  mtry = tune()             # Number of predictors sampled at each split (colsample_bytree)
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")
  • What we’re tuning:
Parameter Meaning Regularization? Notes
tree_depth Max depth of a tree ✅ Yes Controls complexity —-deeper trees fit more
learn_rate Shrinkage step size ✅ Yes Lower = more robust but slower
loss_reduction Minimum gain to split ✅ Yes Like gamma in LightGBM —-prunes weak splits
sample_size Row subsampling rate (0–1) ✅ Yes Helps prevent overfitting
mtry Number of predictors randomly selected per tree ✅ Yes Reduces correlation between trees
trees Total number of boosting rounds No Tune later if needed for final model

This setup gives XGBoost the flexibility to learn nonlinear boundaries while controlling for overfitting using multiple regularization mechanisms.

📌 You’ll tune this model using tune_bayes() in upcoming steps - be mindful of runtime since this grid has high dimensionality.

2.4.2 ⚙ Step 6: Building the Workflow

Code
(model_workflow <- workflow() %>% 
   add_recipe(housing_rec) %>% 
   add_model(model_spec))
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = tune()
  trees = tune()
  tree_depth = tune()
  learn_rate = tune()
  loss_reduction = tune()
  sample_size = tune()

Computational engine: xgboost 

2.4.3 🔁 Step 7: Cross-Validation

Code
(cv_folds <- vfold_cv(train_data, v = 5, repeats = 5))
#  5-fold cross-validation repeated 5 times 
# A tibble: 25 × 3
   splits              id      id2  
   <list>              <chr>   <chr>
 1 <split [4820/1205]> Repeat1 Fold1
 2 <split [4820/1205]> Repeat1 Fold2
 3 <split [4820/1205]> Repeat1 Fold3
 4 <split [4820/1205]> Repeat1 Fold4
 5 <split [4820/1205]> Repeat1 Fold5
 6 <split [4820/1205]> Repeat2 Fold1
 7 <split [4820/1205]> Repeat2 Fold2
 8 <split [4820/1205]> Repeat2 Fold3
 9 <split [4820/1205]> Repeat2 Fold4
10 <split [4820/1205]> Repeat2 Fold5
# ℹ 15 more rows

2.4.4 🔍 Step 8: Hyperparameter Tuning

Now that we’ve specified a complex XGBoost model, we use Bayesian optimization to efficiently explore the hyperparameter space.

Why Bayesian Optimization?

  • Traditional grid search becomes inefficient with many parameters
  • Random search is better, but doesn’t learn from past results
  • Bayesian optimization builds a probabilistic model of the error surface and chooses new parameter combinations based on past performance

This leads to faster convergence toward good-performing combinations, especially useful when tuning 4+ hyperparameters.

Code
param_final <- model_workflow %>% 
  extract_parameter_set_dials %>% 
  finalize(train_data)

⚡ Parallel Processing Setup

Given the number of hyperparameters and resampling iterations in our Bayesian tuning, it’s efficient to run the process in parallel across multiple CPU cores. This can significantly reduce training time.

Why use parallelism?

Each fold x parameter combination is an independent model fit, so it’s ideal for parallel execution. The tune package supports this via the doFuture and future ecosystems.

Code
library(doFuture)         # Backend for parallel model training
registerDoFuture()        # Registers the parallel backend

plan(multisession)        # Launches a separate R session per core

💡 multisession is safe and works across platforms (Windows, macOS, Linux).
On Linux/macOS, plan(multicore) is slightly faster but not supported on Windows.

Once this setup is active, all tune_bayes(), tune_grid(), and fit_resamples() calls will automatically distribute work across available CPU cores.

You can check how many are being used with:

Code
future::availableCores()
system 
    16 

Or manually set the number of workers:

Code
plan(multisession, workers = 4)  # Use 4 cores explicitly

Now you’re ready to run large-scale tuning efficiently!

Code
model_bayes <- tune_bayes(
  model_workflow,
  resamples = cv_folds,               # 5-fold CV folds
  param_info = param_final,           # Parameter space (use finalize())
  metrics = metric_set(rmse),         # Regression metric
  initial = 10,                       # Start with 5 random configurations
  iter = 20,                          # Then refine with 20 adaptive steps
  control = control_bayes(
    verbose = TRUE, 
    time_limit = 5,                   # 5 minutes
    verbose_iter = TRUE,
    no_improve = 5                    # Early stopping after 5 bad iterations
  )
)
Argument Purpose
initial = 5 Try 5 random combinations first to “seed” the model
iter = 20 Then perform 20 smart iterations based on model feedback
no_improve Stop early if no improvement in the last 5 steps
param_info The parameter ranges to explore — use parameters() or finalize() to define these
metrics Evaluation metric — here we’re maximizing accuracy

📦 tune_bayes() is part of the tune package and requires that your parameter set (param_final) be created with extract_parameter_set_dials() or refined with finalize().

Code
model_bayes %>% select_best(metric = "rmse")

The table below shows the optimal combination of hyperparameters selected by Bayesian optimization:

mtry trees tree_depth learn_rate loss_reduction sample_size .config
2 1700 14 0.0130 0.00125 0.355 Iter1

Interpretation of parameters

Parameter Value Meaning
mtry 2 Randomly sample 2 predictors at each tree split
trees 1700 Model was trained over 1700 boosting rounds
tree_depth 14 Each tree can grow up to 14 levels deep
learn_rate 0.013 Low learning rate (slow but steady improvements)
loss_reduction 0.00125 Minimum gain required to split - small value allows deeper exploration
sample_size 0.355 Only ~35% of training data rows were used in each boosting iteration
.config Iter1 Configuration ID in the tuning process

This model favors many deep trees with a low learning rate, which is a typical signature of a strong gradient boosting model.

Next, you’ll finalize the model using these values and fit it to your full training data.


2.5 Multiple Models

So far we’ve trained models one by one. Now let’s take it a step further and test multiple models in parallel using the workflowsets package. This lets us compare performance across different modeling approaches in a clean and organized way.

We’ll compare:

  • Ordinary Least Squares (OLS) - no regularization
  • Elastic Net - regularized linear model (with tuning)
  • Random Forest - nonlinear tree ensemble

2.5.1 🔧 Step 5: Model Spec

We first declare our model blueprints using parsnip. These are not fitted yet.

Code
ols_spec <- linear_reg() %>%
  set_engine("lm")                    

elastic_spec <- linear_reg(
  penalty = tune(), 
  mixture = tune()
) %>%
  set_engine("glmnet")              

rf_spec <- rand_forest(
  mtry = 8,                          
  min_n = 10,
  trees = 100
) %>%
  set_engine("ranger", num.threads = 10) %>%
  set_mode("regression")

xgboost_spec <- boost_tree(
  trees = 1700,        
  tree_depth = 14,     
  learn_rate = 0.0130, 
  loss_reduction = 0.00125,
  sample_size = 0.355,     
  mtry = 2             
) %>%
  set_engine("xgboost") %>%
  set_mode("regression")

We’re using a fixed Random Forest (no tuning), a fully tunable Elastic Net, and a baseline OLS model.

2.5.2 ⚙ Step 6: Building the workflow_set()

We now create a single object containing all model–recipe combinations:

Code
(
  wf_set <- workflow_set(
    preproc = list(housing = housing_rec),   # one shared recipe
    models = list(
      ols = ols_spec,
      RF = rf_spec,
      xgboost = xgboost_spec,
      elastic_spec = elastic_spec            # unnamed here → name is taken from object
    )
  )
)
# A workflow set/tibble: 4 × 4
  wflow_id             info             option    result    
  <chr>                <list>           <list>    <list>    
1 housing_ols          <tibble [1 × 4]> <opts[0]> <list [0]>
2 housing_RF           <tibble [1 × 4]> <opts[0]> <list [0]>
3 housing_xgboost      <tibble [1 × 4]> <opts[0]> <list [0]>
4 housing_elastic_spec <tibble [1 × 4]> <opts[0]> <list [0]>
  • 🔧 Each row in wf_set represents a unique combination of recipe + model.

This is a powerful way to streamline model comparison, parallel tuning and ranking based on performance.

2.5.3 🔁 Step 7: Cross-Validation

Next, we’ll tune the models (where needed) and compare their performance using cross-validation.

Code
(cv_folds <- vfold_cv(train_data, v = 5,  repeats = 5))
#  5-fold cross-validation repeated 5 times 
# A tibble: 25 × 3
   splits              id      id2  
   <list>              <chr>   <chr>
 1 <split [4820/1205]> Repeat1 Fold1
 2 <split [4820/1205]> Repeat1 Fold2
 3 <split [4820/1205]> Repeat1 Fold3
 4 <split [4820/1205]> Repeat1 Fold4
 5 <split [4820/1205]> Repeat1 Fold5
 6 <split [4820/1205]> Repeat2 Fold1
 7 <split [4820/1205]> Repeat2 Fold2
 8 <split [4820/1205]> Repeat2 Fold3
 9 <split [4820/1205]> Repeat2 Fold4
10 <split [4820/1205]> Repeat2 Fold5
# ℹ 15 more rows

2.5.4 🔍 Step 8: Tune Multiple Models

Now that we’ve bundled all models into a single workflow_set, we can tune and evaluate them all at once using workflow_map().

This function:

  • Automatically handles tuning for models with tunable parameters (like Elastic Net)
  • Skips tuning for models without parameters (e.g. OLS, Random Forest, Xgboost if already finalized)
  • Evaluates all models using the same cross-validation folds
  • Returns metrics, predictions, and timing info in a single tidy object
Code
set.seed(345)

results <- wf_set %>%
  workflow_map(
    seed = 456,
    resamples = cv_folds,
    # grid = NA,  # Default grid size (autotune if needed)
    metrics = metric_set(rmse, rsq, ccc),  # Concordance correlation optional
    control = control_grid(save_pred = TRUE)
  )

workflow_map() detects whether each model needs tuning or not, applies the same cv_folds resampling strategy and stores all the results in one object.

This approach simplifies tuning across multiple models, and ensures fair comparisons under the same validation setup.

Next, we’ll rank and visualize model performance.

2.5.5 🧾 Step 9: Final Fit and Evaluation

In this step, we compare the predictive performance of all models using - the proportion of variance in the target variable explained by the model.

  • R² (coefficient of determination) tells us how much of the outcome’s variation the model can explain.
  • A higher R² means better goodness-of-fit.
  • Unlike RMSE or MAE, R² is unitless and bounded between 0 and 1 (though it can be negative if the model performs worse than the mean).
Code
(
rsq_summary <- collect_metrics(results) %>%
  filter(.metric == "rsq") %>%
  group_by(wflow_id) %>%
  summarize(
    mean_rsq = mean(mean),
    std_err = mean(std_err)
  ) %>%
  arrange(desc(mean_rsq))
)
# A tibble: 4 × 3
  wflow_id             mean_rsq std_err
  <chr>                   <dbl>   <dbl>
1 housing_RF              0.516 0.00395
2 housing_xgboost         0.516 0.00372
3 housing_elastic_spec    0.459 0.00397
4 housing_ols             0.459 0.00397
Code
ggplot(rsq_summary, aes(x = reorder(wflow_id, mean_rsq), y = mean_rsq)) +
  geom_point(size = 3, color = "steelblue") +
  geom_errorbar(aes(ymin = mean_rsq - std_err, ymax = mean_rsq + std_err),
                width = 0.2, color = "darkgray") +
  labs(
    title = "Model Comparison by R²",
    x = "Model",
    y = "Mean R²"
  ) +
  theme_minimal()

  • A model with higher R² and narrow error bars indicates both strong and stable performance across folds.

Use this visualization alongside RMSE/MAE comparisons to determine: - Which model fits the data best overall - Whether gains in predictive power justify model complexity

  • Lets also check RMSE
Code
(
rmse_summary <- collect_metrics(results) %>%
  filter(.metric == "rmse") %>%
  group_by(wflow_id) %>%
  summarize(
    mean_rmse = mean(mean),
    std_err = mean(std_err)
  ) %>%
  arrange(desc(mean_rmse))
)
# A tibble: 4 × 3
  wflow_id             mean_rmse std_err
  <chr>                    <dbl>   <dbl>
1 housing_ols           1123539.   4615.
2 housing_elastic_spec  1123492.   4619.
3 housing_xgboost       1063401.   4383.
4 housing_RF            1063302.   4986.
Code
ggplot(rmse_summary, aes(x = reorder(wflow_id, mean_rmse), y = mean_rmse)) +
  geom_point(size = 3, color = "steelblue") +
  geom_errorbar(aes(ymin = mean_rmse - std_err, ymax = mean_rmse + std_err),
                width = 0.2, color = "darkgray") +
  labs(
    title = "Model Comparison by RMSE",
    x = "Model",
    y = "Mean RMSE"
  ) +
  theme_minimal()

And that is why I love RF so much