Forecasting personal consumption expenditure (PCE) with Keras
In this section, we build a deep learning model using the keras3 package in R to forecast Personal Consumption Expenditure (PCE) based on macroeconomic indicators. We will use a feedforward neural network (fully connected layers) and evaluate its ability to learn nonlinear relationships in the data.
Our features include: - pop: Total population - psavert: Personal savings rate - uempmed: Median duration of unemployment - unemploy: Number of unemployed individuals
The goal is to predict the value of pce at each point in time. We use a time-aware split: the first 80% of the series for training, and the last 20% for testing.
To simulate a 1-step-ahead forecast, we shift the target variable (pce) forward. This means:
The features (pop, psavert, uempmed, unemploy) are taken at time t
The target is pce_{t+1} - the PCE at the next time step
This setup allows us to use a standard feedforward neural network for forecasting, not just fitting. This transforms the original time series into a supervised learning dataset, where the last row in the dataset corresponds to a genuine forecast, since it uses known data at time T to predict pce_{T+1}.
This is a heuristic choice — enough capacity to model nonlinear interactions without being too large for small datasets.
It gives the model the ability to learn complex feature interactions.
You might start with 64, 128, or 256 and tune from there.
Why ReLU?
It introduces nonlinearity while being fast to compute.
Avoids saturation issues seen in sigmoid/tanh for deep nets.
ReLU is the default hidden activation in modern deep learning.
This second hidden layer (layer_dense(units = 128, activation = 'relu') + layer_dropout(rate = 0.3)) allows the network to model deeper nonlinear relationships. Using fewer units (128 < 256) follows a common practice of layer tapering, i.e., narrowing the network as you go deeper.
Using fewer units (128 < 256) follows a common practice of layer tapering, i.e., narrowing the network as you go deeper.
Why Dropout?
Dropout randomly turns off neurons during training to prevent overfitting.
Particularly useful when your dataset is small relative to the network’s capacity - like your 574 observations.
A rate between 0.2–0.5 is typical.
0.4 is slightly aggressive, which can be useful if the model starts overfitting early.
Output = layer_dense(units = 1)
This is the output layer for regression: (1) Only one unit since you’re predicting a single continuous value (PCE). (2) No activation function, this allows the output to take any real value, which is appropriate for unbounded regression targets.
Inspect
Code
summary(model)
1.4.3 Compile and Fit
Before training the neural network, we must compile it. This step defines:
The loss function - what the model tries to minimize
The optimizer - the algorithm that updates weights during training
Evaluation metrics - how we track performance during training and validation
Code
model %>%compile(loss ='mse',optimizer =optimizer_adam(),metrics =list('mae'))
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments.
Once the model is compiled, we train it on the data using the fit() function. This is where the model learns the relationship between inputs (x) and the target (y) by adjusting its internal weights over multiple passes through the data.
x = features, y = target: These are the input and output data used for training.
epochs = 50: The number of full passes through the training data. More epochs allow the model to learn more, but risk overfitting if too high.
batch_size = 16: The number of samples the model processes before updating weights. Smaller batches provide more frequent updates, but may be noisier.
validation_split = 0.2: 20% of the training data is held out for validation. This helps monitor the model’s performance on unseen data during training, useful for diagnosing overfitting or underfitting.
The fit() function returns a history object that stores the training and validation loss/metrics for each epoch, which can be visualised later.
Code
history <- model %>%fit(x = features,y = target,epochs =50,batch_size =16,validation_split =0.2)plot(history)
LSTM networks are sensitive to the scale and distribution of input data. To ensure stable and efficient training, we apply the following preprocessing steps using the recipes package:
step_sqrt(): Applies a square-root transformation to all numeric predictors. This helps to reduce the effect of large outliers and skewed distributions.
step_center() and step_scale(): Standardize the predictors by subtracting the mean and dividing by the standard deviation. This ensures that all inputs are on a similar scale - essential for neural network training.
We apply these transformations to the predictors only (not the outcome variable pce) and prepare the data with prep() and bake():
If we want to invert the scaling of model predictions later (e.g., return forecasts to the original PCE scale), we need to extract the mean and standard deviation used for the outcome variable:
Before building an LSTM model, it’s important to understand how the data must be structured and what architectural choices affect training stability and performance. Below are a few key considerations when working with LSTM models in Keras:
Tensor Format
Predictors (X) must be a 3D array with dimensions: [samples, timesteps, features]
where:
samples: number of observations (rows of training data)
timesteps: number of lags used per sample
features: number of input variables
Targets (y) must be a 2D array with dimensions: [samples, timesteps] or [samples, 1]
depending on whether you predict a single value or a full sequence.
Training/Testing Size Compatibility
The length of the training and testing sets should be evenly divisible, especially when using stateful LSTMs or batching.
This helps ensure clean reshaping and batch alignment during training.
Batch Size
Batch size defines how many training examples are processed before updating weights.
To ensure compatibility:
training size / batch size and
testing size / batch size
must both be whole numbers (i.e., no remainder).
Time Steps
A time step is the number of lags used in each input sequence.
In our setup, we use 1 time step, which means we are using one lagged value to predict the next observation.
Epochs
The number of epochs controls how many times the model will iterate over the entire training set.
More epochs typically improve learning, but excessive training can lead to overfitting, visible when the validation loss stops improving.
Model Design Choices
Based on the guidelines above, we define the following modeling plan:
Training window:
Each resample uses 121 months of past data (10 years) for training. This is a moving window that shifts forward one month at a time.
Forecast horizon:
We perform 1-step-ahead forecasts using a single month as the assessment set in each resample. This provides a detailed view of model accuracy across time.
Time steps:
We set time steps = 1, which means the LSTM uses only the most recent observation (1 lag) from each time step. This is a simplified but effective structure for 1-step forecasting.
Resample count:
With 574 months of data and a 121-month window, this results in 453 rolling resamples, providing robust evaluation coverage across the entire dataset.
Batch size:
We choose a batch size of 11, which evenly divides into the 121 training rows (121 / 11 = 11). This ensures consistency and avoids batching issues in Keras.
Epochs:
We start with 300 epochs, allowing the model to sufficiently learn from the training window. This can be tuned or reduced using early stopping if validation performance plateaus.
This plan provides a balance between accuracy, model complexity, and computational efficiency.
Code
# Model configuration for sliding LSTM forecastlag_setting <-1# Number of lags (time steps) used as inputbatch_size <-11# Divides evenly into 121 training observationstrain_length <-121# Length of each training window (10 years)tsteps <-1# Number of time steps fed into the LSTMepochs <-300# Number of training iterations