US Treasury Yield prediction with Deep Learning


Using multivariate multistep LSTM based Encoder-Decoder architecture with the Attention mechanism.

Sample comparison of actual yield curve vs predicted yield curve

Table of Contents

  1. Motivation
  2. Prior Knowledge
  3. Data & Exploration
  4. Modelling
  5. Results
  6. Sample Predictions
  7. Conclusion & Next Steps


Shifts in the shape and slope of the yield curve are thought to be related to investor expectations for the economy and interest rates. During normal times, the short term rates are lower than long term rates. And it is hypothesised but not established as a clear fact that during the recession and other adverse events, the trend changes or even reverses. That is the short term rates will rise as investors rush to go long on Bonds with longer maturity.

So it always serves as an advantage to know how this curve is evolving over time as many economists believe that US Treasury Yield can predict the movements of other financial markets such as stocks, futures and options markets. The trades of US Treasury issuances have a huge influence on the global economy as well.

Most financial prediction models which are a sort of multidimensional time-series prediction focus only on predicting stock price. Since U.S. Treasury Yield could be viewed as a multidimensional time series as well, the objective is to apply those models to predict the term structure.

In this article, I will describe how I used a Long Short Term Memory (LSTM) based Encoder-Decoder Network with the Attention mechanism to predict the evolution of Term Structure over the next 10 days by looking at the previous 30 days' rates.

Prior Knowledge

Recurrent Neural Networks are well suited for learning relationships across time. But they cannot maintain long term temporal dependencies because of the vanishing gradients problem: The gradient of the loss function decays exponentially with time. But LSTM is a type of recurrent neural network which is good at remembering long term temporal dependencies through some special gates as shown below.

Single LSTM Cell Structure
Single LSTM Cell Structure. The image is taken from David Foster’s Generative Deep Learning Book by O’Reilly Media.

Encoder-Decoder architectures are used for the sequence to sequence modelling and are fairly common in Deep Learning based Natural Language Processing tasks such as Question Answering, Machine Translation, Language Modelling, etc. Here, they are being used for multivariate multistep time-series forecasting since this is a sequence to sequence modelling task as well.

In normal Encoder-Decoder architectures based on Recurrent Neural Networks, after the encoding step, only the final timestep’s hidden state is used as a feature representation vector. But Attention mechanism works by considering a weighted sum of the hidden states from all timesteps.

Explaining the concepts behind the methodology any further is out of the scope of this article. To understand the gist, one needs to have a working idea in two domains; Financial Markets and Deep Learning. One should read about:

Now let’s dive into the analysis…

Data & Exploration

The dataset is downloaded from the US Treasury website.
It has 13 columns as shown below. The first is the date and the rest are the term yields of various maturity periods for a particular date.

A sample snapshot of the dataset.
Descriptive statistics of the dataset.

The 2 Month rate is excluded from the analysis as the number of observations is very low. This is because the US Government paused the issuance of 2 Month Treasury Bills for a long time in between.

Below is a sample yield curve plot for the date 06/29/1992.

Sample Yield Curve

Additionally, there are missing values where there is no data available for the treasury rate. All these missing values are replaced by the corresponding mean rates when there is no issuance and for the others, the missing values are handled by interpolating from the adjacent term rates.

Sample Yield Curve after filling in missing values

The Horizon considered for the analysis is from 1990 to 2022. This timeline contains many adverse events such as The Great Financial Crisis of 2008, The Covid Pandemic of 2020, and The Dot-Com Bubble 2000. During these highly uncertain periods, the yield curve deviates from its normal increasing trajectory.

A visualization is presented below which shows that roughly around these events, there was a drastic increase in the interest rates which is shown as bottlenecks in the flow. During bad times, the short term interest rates, most of the time, are higher than the long term interest rates. But this cannot be considered a certain trend as there are a plethora of other variables affecting the term structure.

Evolution of the term rates from 1990 to 2022


Data Transformation:
For building the model, we need to create the dataset. For the prediction, we need the last 30 days of data to predict the next 10 days’ rates. The following code block implements this. We can control the look_back and look_ahead variables to change the timeframes under consideration.

Dataset Generation for modelling
Sample X and y

Model Architecture:
The architecture I used for the Encoder-Decoder model with Attention is inspired by this snapshot.

Encoder Decoder Network with Attention
Encoder-Decoder Network with Attention. The image is taken from David Foster’s Generative Deep Learning Book by O’Reilly Media.

Both Encoder network and Decoder network are identical in structure with 300 Hidden Units in each LSTM cell and 4 layers deep except for the Attention mechanism at the end of the Encoder network. For the Encoder, the input dimension is 11 which is the same as the output dimension of the Decoder.

I used Google Colab and PyTorch for building the models and training them on GPUs for parallel processing resulting in faster training times.


Code for Encoder Network with Attention


Code for Decoder Network

Encoder-Decoder Wrapper:

Wrapper Code for combining Encoder and Decoder

Model Parameters:
After tuning, the following final hyperparameters have been used for building the model.

Above hyperparameters were chosen for Modelling

Huber Loss is used as the objective function while training as it is known to be insensitive to the outliers and we can control the delta parameter for our model. This is a combination of both L1 and L2 loss. The Huber Loss function is given by:

Huber Loss function (Wikipedia)

Here we used Adam Optimiser for backpropagation of loss and updating the weights with a learning rate of 1e-5. This is the default choice in training deep neural networks nowadays.

The training is done for 150 epochs and the resulting training and validation loss looks as shown below. It can be observed that the training loss is higher than the validation loss but as the epochs progress, both the losses stabilize and converge. One of the reasons for this phenomenon might be the use of dropout regularization while training the model.

Training vs Validation Loss Evolution with Epochs


The Mean Absolute Error for each day predicted for the future for all the data points is presented below.

Mean Absolute Error metric for predicted days

It can be seen that the MAE, except for the first two days, keeps increasing till the last day. This implies that the predictions are getting weaker as we move into the future because of a higher degree of uncertainty involved.

Mean Absolute Error metric for predicted term rates

The above table describes the mean absolute error for each term rate. It looks like the model is able to predict the rate of 6 Month treasury rate better than other rates based on the error metrics. 5 Year rate has the highest MAE.

Sample Predictions

The following samples demonstrate how our model is able to approximate the yield curve, sometimes even replicating the variance present in the actual curve.

Conclusion & Next Steps

In this article, we predicted the term structure of interest rates for (t+10) days based on the yield curves for (t-30) days. This was performed using a multivariate multistep LSTM based Encoder-Decoder network with the Attention mechanism. Huber Loss and Adam Optimizer have been used in training the model.

The loss results presented have an increasing trend with the predictions into the future, implying an increasing degree of uncertainty. The overall Mean Absolute Error is around 0.11 or 1100 basis points.

This work can be further adapted to include macroeconomic variables which have sequential time-series data such as Consumer Price Index, Unemployment Rate, Gross Domestic Product and Imports & Exports data.

One ambitious step ahead can be to include Bayesian Inference based neural networks which can learn the distribution of parameters of a model rather than the parameters themselves. This can predict the distribution of each term rate on the yield curve rather than a deterministic value, which might prove to be an even better function approximator than the current model.