Creating a diversified stock portfolio using unsupervised learning

  1. Exploring intra feature correlation matrix using correlogram

In this article I showcase the results of performing clustering analysis on the S&P 500 historical daily price change data to create a diversified portfolio of stocks and back test its performance against the index. For this I use the S&P500 index data. I attempt is to use K-Means clustering based on euclidian distances to understand the effect of different parameters that affect the stock performance.

You can find a detailed summary of the analysis and code here:

Project Details
To perform clustering, Feature vectors were created using financial ratios
calculated upon historical price movements. The following features were calculated from the 10 year daily historical data for all the stocks in the S&P 500 index

  • Correlation with SP500 index value
  • Beta with SP500 index value
  • Annualized Return on equity (daily returns)
  • Annualized Volatility on equity (daily returns)
  • Sharpe Ratio
  • Daily Change in price
  • Daily Variation in price

More time series financial data / ratios can be included to improve clustering process

Exploring intra feature correlation matrix using correlogram

The following plots show the correlogram plotted from the correlation matrix on the feature vectors. Most variables are not strongly correlated except for annualized volatility and annualized daily variation (which is basically annualized mean of change of low and high price in the daily data)

Collerogram for feature vectors
Collerogram values for feature vectors
Snapshot of the data frame prepared for clustering

K-Means Clustering

The following results depict the optimal value for choosing K value using a scree plot and the clusters convex formed after choosing K=4. The stock symbols are used to represent its relative position in the cluster.

Scree / Elbow plot. Clearly the elbow is formed at K = 4
K-Means clusters reduced to two dimensional space for visualizations
Cluster wise annualized returns vs annualized volatility

From the above plots, we observe vital information regarding the four clusters:

  1. Cluster 4 has the best performance in the market but has variable performance because of certain outliers and can be regarded as stocks with high returns and low volatility
  2. Cluster 1 has lower volatility but also has lower returns. This cluster is a bit more stable and has comparable returns and risks when compared with cluster 3. However, they collectively offer lesser returns for same or lower volatility in comparison with cluster 2. The stocks in this cluster have low return & low volatility
  3. Cluster 2 indicates high risk with low returns that discourages investors looking to be a little aggressive to make some money in the market.
  4. Cluster 3 has the highest average annualized returns & volatility. The stocks under this cluster are high risk, high return stocks.

Back testing results
For validating the process of using clustering for creating a diversified portfolio we back tested it performance on the test/validation data. The clustering was performed on the first 7 years of data and then the remaining 3 years of data were used to validate the results of our portfolio. For this, two portfolios containing 20 stocks were created

  1. Portfolio created using top five stocks (as per Sharpe ratio) from each cluster — [RED]
  2. Portfolio created using top 20 stocks out of all 500 as per Sharpe ratio from the 7-year historical
    performance — [ORANGE]

The K-Means Portfolio has the following stocks:

The K-Means Portfolio has stock from the following Sectors: Basic Materials, Communication Services, Consumer Cyclical, Consumer Defensive, Energy, Financial Services, Healthcare, Industrials, Real Estate, Technology.

  • We see that the orange portfolio, which is a collection of stocks with the highest Sharpe ratio, outperformed the S&P500 index. The portfolio formed using k-means clustering (red line) has a better performance.
  • This indicates that the K-Means clustering successfully created a diversified portfolio in terms of all the features mentioned during clustering and not only outperformed the S&P 500 index but also a collection of stocks with best historical performance.
  • The back-testing results indicate that the k-means portfolio was correlated with the index during COVID- 19 and recovered slower than the orange index. However, as the portfolio was highly diversified the k- means portfolio had a far better long-term performance in comparison with the orange portfolio

Final Observations

  • Clustering can be used to create diversified portfolios.
  • The portfolio created using top 5 stocks from each cluster out-performed the index and to validate this we checked its performance against top 20 historical stocks as per Sharpe ratio
  • More time series financial data / ratios can be included to improve clustering process

Improvements / Future Scope

  • This study can be implemented on all publicly traded companies across exchanges and countries.
  • There are about 2000 listed stocks in the US alone
  • There are approximately 41 000 listed companies in the world with a combined market value of more
    than USD 80 trillion