Practical

1 libraries

1.1 Basic setup

Code

library(tidyverse)
library(knitr)
library(glue)
library(bertheme)
library(tidymodels)
library(modeltime)

knitr::opts_chunk$set(
  echo = TRUE,
  eval = FALSE,
  message = FALSE,
  warning = FALSE,
  fig.width = 6,
  fig.height = 5,
  fig.align='center',
  cache = FALSE
  )

1.2 Functions

Code

# Plotting function for a single split
plot_split <- function(split, expand_y_axis = TRUE, alpha = 1, size = 1, base_size = 14) {
    
    # Manipulate data
    train_tbl <- training(split) %>%
        add_column(key = "training") 
    
    test_tbl  <- testing(split) %>%
        add_column(key = "testing") 
    
    data_manipulated <- bind_rows(train_tbl, test_tbl) %>%
        as_tbl_time(index = date) %>%
        mutate(key = fct_relevel(key, "training", "testing"))
        
    # Collect attributes
    train_time_summary <- train_tbl %>%
        tk_index() %>%
        tk_get_timeseries_summary()
    
    test_time_summary <- test_tbl %>%
        tk_index() %>%
        tk_get_timeseries_summary()
    
    # Visualize
    g <- data_manipulated %>%
        ggplot(aes(x = date, y = pce, color = key)) +
        geom_line(size = size, alpha = alpha) +
        theme_tq(base_size = base_size) +
        scale_color_tq() +
        labs(
            title    = glue("Split: {split$id}"),
            subtitle = glue("{train_time_summary$start} to {test_time_summary$end}"),
            y = "", x = ""
        ) +
        theme(legend.position = "none") 
    
    return(g)
}

1.3 Keras Package

Keras is a high-level API designed to make deep learning fast, flexible and user-friendly - ideal for both research and production.

Runs on CPU or GPU: Same code runs seamlessly on either.
User-friendly: Intuitive syntax for rapid model development and experimentation.
Built-in support: Provides ready-to-use components for convolutional networks (CNNs), recurrent networks (RNNs) or combinations of both.
Flexible architectures: Supports complex designs such as:
- Multi-input and multi-output models
- Shared layers
- Composable models like memory networks or neural Turing machines

Keras enables quick transitions from ideas to working prototypes, making it a powerful tool for deep learning workflows.

Code

install.packages("keras3", Ncpus = 5)
library(keras3)

1.4 Basic Example (Multilayer Perceptron (MLP))

Forecasting personal consumption expenditure (PCE) with Keras

In this section, we build a deep learning model using the keras3 package in R to forecast Personal Consumption Expenditure (PCE) based on macroeconomic indicators. We will use a feedforward neural network (fully connected layers) and evaluate its ability to learn nonlinear relationships in the data.

Our features include: - pop: Total population - psavert: Personal savings rate - uempmed: Median duration of unemployment - unemploy: Number of unemployed individuals

The goal is to predict the value of pce at each point in time. We use a time-aware split: the first 80% of the series for training, and the last 20% for testing.

To simulate a 1-step-ahead forecast, we shift the target variable (pce) forward. This means:

The features (pop, psavert, uempmed, unemploy) are taken at time t
The target is pce_{t+1} - the PCE at the next time step

This setup allows us to use a standard feedforward neural network for forecasting, not just fitting. This transforms the original time series into a supervised learning dataset, where the last row in the dataset corresponds to a genuine forecast, since it uses known data at time T to predict pce_{T+1}.

Code

library(keras3)
library(tidyverse)

data <- ggplot2::economics %>%
  mutate(pce_lead = lead(pce)) %>% 
  drop_na()

1.4.1 Prepare data

Code

features <- data %>%
  select(pop, psavert, uempmed, unemploy) %>%
  scale() %>%
  as.matrix()

target <- data$pce

1.4.2 Define model

Code

model <- keras_model_sequential()  %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = ncol(features)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.3)  %>% 
  layer_dense(units = 1)  # regression output (no activation)

Why 256 units?
- This is a heuristic choice — enough capacity to model nonlinear interactions without being too large for small datasets.
- It gives the model the ability to learn complex feature interactions.
- You might start with 64, 128, or 256 and tune from there.
Why ReLU?
- It introduces nonlinearity while being fast to compute.
- Avoids saturation issues seen in sigmoid/tanh for deep nets.
- ReLU is the default hidden activation in modern deep learning.
- This second hidden layer (layer_dense(units = 128, activation = 'relu') + layer_dropout(rate = 0.3)) allows the network to model deeper nonlinear relationships. Using fewer units (128 < 256) follows a common practice of layer tapering, i.e., narrowing the network as you go deeper.

Using fewer units (128 < 256) follows a common practice of layer tapering, i.e., narrowing the network as you go deeper.

Why Dropout?
- Dropout randomly turns off neurons during training to prevent overfitting.
- Particularly useful when your dataset is small relative to the network’s capacity - like your 574 observations.
- A rate between 0.2–0.5 is typical.
- 0.4 is slightly aggressive, which can be useful if the model starts overfitting early.
Output = layer_dense(units = 1)
This is the output layer for regression: (1) Only one unit since you’re predicting a single continuous value (PCE). (2) No activation function, this allows the output to take any real value, which is appropriate for unbounded regression targets.
Inspect

Code

summary(model)

1.4.3 Compile and Fit

Before training the neural network, we must compile it. This step defines:

The loss function - what the model tries to minimize
The optimizer - the algorithm that updates weights during training
Evaluation metrics - how we track performance during training and validation

Code

model %>%  compile(
  loss = 'mse',
  optimizer = optimizer_adam(),
  metrics = list('mae')
)

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.

Once the model is compiled, we train it on the data using the fit() function. This is where the model learns the relationship between inputs (x) and the target (y) by adjusting its internal weights over multiple passes through the data.

x = features, y = target: These are the input and output data used for training.
epochs = 50: The number of full passes through the training data. More epochs allow the model to learn more, but risk overfitting if too high.
batch_size = 16: The number of samples the model processes before updating weights. Smaller batches provide more frequent updates, but may be noisier.
validation_split = 0.2: 20% of the training data is held out for validation. This helps monitor the model’s performance on unseen data during training, useful for diagnosing overfitting or underfitting.

The fit() function returns a history object that stores the training and validation loss/metrics for each epoch, which can be visualised later.

Code

history <- model  %>% fit(
  x = features,
  y = target,
  epochs = 50,
  batch_size = 16,
  validation_split = 0.2
)
plot(history)

Evaluate the models performance on the test data:

Code

model %>% predict()

2 LSTM

For more features of modeltime and friends: https://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html and http://datasideoflife.com/?p=1171

Code

# Core Tidyverse
library(tidyverse)
library(glue)
library(forcats)

# Time Series
library(timetk)
library(tidyquant)
library(tibbletime)

# Visualization
library(cowplot)

# Preprocessing
library(recipes)

# Sampling / Accuracy
library(rsample)
library(yardstick) 

# Modeling
library(keras3)

Code

rolling_origin()

2.1 Data setup

574 observations

Code

data <- ggplot2::economics %>%
  select(date, pce, pop, psavert, uempmed, unemploy) %>%
  drop_na()

Use 10 years of historical data (120 months) as the training window
Predict 1 month ahead as the validation target
Roll the window forward one step at a time until the end of the dataset
Use cumulative = TRUE - so each new window includes all prior observations (simulating a growing dataset as time progresses)
Use no skip, meaning each resample moves forward month by month

Code

# 10 years of training
periods_train <- 12 * 10   
periods_test  <- 1

rolling_origin_resamples <- sliding_period(
  data,
  # Date Column to Use
  date,
  "month",
  lookback = 120,
  assess_stop = 1, 
  complete = TRUE,
  step = 1L,
  skip = 0L
)

# Sliding window resampling 
# Sliding period resampling 
# A tibble: 453 × 2
#    splits          id      
#    <list>          <chr>   
#  1 <split [121/1]> Slice001
#  2 <split [121/1]> Slice002
#  3 <split [121/1]> Slice003
#  4 <split [121/1]> Slice004
#  5 <split [121/1]> Slice005
#  6 <split [121/1]> Slice006
#  7 <split [121/1]> Slice007
#  8 <split [121/1]> Slice008
#  9 <split [121/1]> Slice009
# 10 <split [121/1]> Slice010

Code

analysis(rolling_origin_resamples$splits[[1]]) %>% tail
assessment(rolling_origin_resamples$splits[[1]])

# > analysis(rolling_origin_resamples$splits[[1]]) %>% tail
# # A tibble: 6 × 6
#   date         pce    pop psavert uempmed unemploy
#   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
# 1 1977-02-01 1231. 219344     9.3     7.2     7443
# 2 1977-03-01 1238. 219504    10.5     7.2     7307
# 3 1977-04-01 1247. 219684    10.5     7.3     7059
# 4 1977-05-01 1257. 219859    10.3     7.9     6911
# 5 1977-06-01 1264. 220046    10.6     6.2     7134
# 6 1977-07-01 1280. 220239    10.5     7.1     6829
# > assessment(rolling_origin_resamples$splits[[1]])
# # A tibble: 1 × 6
#   date         pce    pop psavert uempmed unemploy
#   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
# 1 1977-08-01 1286. 220458    10.9       7     6925

Code

split    <- rolling_origin_resamples$splits[[11]]
split_id <- rolling_origin_resamples$id[[11]]

plot_split(split, expand_y_axis = FALSE, size = 0.5) +
    theme(legend.position = "bottom") +
    ggtitle(glue("Split: {split_id}"))

Code

df_trn <- training(split)
df_tst <- testing(split)

df <- bind_rows(
    df_trn %>% add_column(key = "training"),
    df_tst %>% add_column(key = "testing")) %>% 
    as_tbl_time(index = date)

df

2.2 Preprocess

LSTM networks are sensitive to the scale and distribution of input data. To ensure stable and efficient training, we apply the following preprocessing steps using the recipes package:

step_sqrt(): Applies a square-root transformation to all numeric predictors. This helps to reduce the effect of large outliers and skewed distributions.
step_center() and step_scale(): Standardize the predictors by subtracting the mean and dividing by the standard deviation. This ensures that all inputs are on a similar scale - essential for neural network training.

We apply these transformations to the predictors only (not the outcome variable pce) and prepare the data with prep() and bake():

Code

rec_obj <- recipe(pce ~ ., df) %>%
    step_sqrt(all_numeric_predictors()) %>%
    step_center(all_numeric_predictors()) %>%
    step_scale(all_numeric_predictors()) %>%
    prep()

df_processed_tbl <- bake(rec_obj, df)

df_processed_tbl

If we want to invert the scaling of model predictions later (e.g., return forecasts to the original PCE scale), we need to extract the mean and standard deviation used for the outcome variable:

Code

center_history <- rec_obj$steps[[2]]$means["value"]
scale_history  <- rec_obj$steps[[3]]$sds["value"]

c("center" = center_history, "scale" = scale_history)

2.3 Model building

2.4 LSTM Modeling Plan

Before building an LSTM model, it’s important to understand how the data must be structured and what architectural choices affect training stability and performance. Below are a few key considerations when working with LSTM models in Keras:

Tensor Format

Predictors (X) must be a 3D array with dimensions:
[samples, timesteps, features]
where:
- samples: number of observations (rows of training data)
- timesteps: number of lags used per sample
- features: number of input variables
Targets (y) must be a 2D array with dimensions:
[samples, timesteps] or [samples, 1]
depending on whether you predict a single value or a full sequence.

Training/Testing Size Compatibility

The length of the training and testing sets should be evenly divisible, especially when using stateful LSTMs or batching.
This helps ensure clean reshaping and batch alignment during training.

Batch Size

Batch size defines how many training examples are processed before updating weights.
To ensure compatibility:
- training size / batch size and
- testing size / batch size
  must both be whole numbers (i.e., no remainder).

Time Steps

A time step is the number of lags used in each input sequence.
In our setup, we use 1 time step, which means we are using one lagged value to predict the next observation.

Epochs

The number of epochs controls how many times the model will iterate over the entire training set.
More epochs typically improve learning, but excessive training can lead to overfitting, visible when the validation loss stops improving.

Model Design Choices

Based on the guidelines above, we define the following modeling plan:

Training window:
Each resample uses 121 months of past data (10 years) for training. This is a moving window that shifts forward one month at a time.
Forecast horizon:
We perform 1-step-ahead forecasts using a single month as the assessment set in each resample. This provides a detailed view of model accuracy across time.
Time steps:
We set time steps = 1, which means the LSTM uses only the most recent observation (1 lag) from each time step. This is a simplified but effective structure for 1-step forecasting.
Resample count:
With 574 months of data and a 121-month window, this results in 453 rolling resamples, providing robust evaluation coverage across the entire dataset.
Batch size:
We choose a batch size of 11, which evenly divides into the 121 training rows (121 / 11 = 11). This ensures consistency and avoids batching issues in Keras.
Epochs:
We start with 300 epochs, allowing the model to sufficiently learn from the training window. This can be tuned or reduced using early stopping if validation performance plateaus.

This plan provides a balance between accuracy, model complexity, and computational efficiency.

Code

# Model configuration for sliding LSTM forecast
lag_setting  <- 1       # Number of lags (time steps) used as input
batch_size   <- 11      # Divides evenly into 121 training observations
train_length <- 121     # Length of each training window (10 years)
tsteps       <- 1       # Number of time steps fed into the LSTM
epochs       <- 300     # Number of training iterations

2.4.1 2D And 3D Train/Test Arrays

Code

# Define predictors to include
predictor_vars <- c("pop", "psavert", "uempmed", "unemploy")

# Create lagged predictors
df_lagged <- df_processed_tbl %>%
  group_by(key) %>%
  mutate(across(all_of(predictor_vars), ~ lag(.x, n = lag_setting), .names = "lag_{.col}")) %>%
  drop_na() %>%
  ungroup()

# Split into training and testing sets
train_data <- df_lagged %>% filter(key == "training")
test_data  <- df_lagged %>% filter(key == "testing")

# Convert predictors to arrays
x_train_arr <- train_data %>%
  select(starts_with("lag_")) %>%
  as.matrix() %>%
  array(dim = c(nrow(.), 1, length(predictor_vars)))

x_test_arr <- test_data %>%
  select(starts_with("lag_")) %>%
  as.matrix() %>%
  array(dim = c(nrow(.), 1, length(predictor_vars)))

# Convert target to arrays
y_train_arr <- train_data %>%
  pull(pce) %>%
  array(dim = c(length(.), 1))

y_test_arr <- test_data %>%
  pull(pce) %>%
  array(dim = c(length(.), 1))

2.5 Model

Code

model <- keras_model_sequential() %>%
  layer_lstm(units = 50, input_shape = c(1, length(predictor_vars))) %>%
  layer_dense(units = 1)


model %>% compile(loss = 'mae', optimizer = 'adam')

summary(model)

history <- model %>% fit(
  x = x_train_arr,
  y = y_train_arr,
  epochs = 50,
  batch_size = 11,
  validation_split = 0.2
)

predictions <- model %>% predict(x_test_arr)