Alternative Data in Financial Arbitrage



The arising innovations of machine learning algorithms have allowed more use of data and features to predict the movement of the financial market. This paper aims to discuss statistical arbitrage with alternative data.


Unlike traditional data, alternative data allows the analyst to access a broader range of real-time data sources, such as overnight tweets and blogs. In contrast, the traditional data often is backward-looking, such as infrequent annual report disclosure. More clues have proven that alternative data has a strong influence on trading. For example, Elon Musk’s tweets caused a momentum effect on Tesla’s stock, and website reviews predict the market’s expectations about a company. Thus, if we capture the correlation between these alternative data and the stock movement, we can predict the trend ahead of time and acquire abnormal returns.

Scenario 1

In the research paper A Portfolio Strategy Using Glassdoor’s Business Outlook Ratings, Professor Snow discussed the impact of business outlook adjustments on portfolio management based on Glassdoor datasets which provided employees’ perception of a company’s outlook in real-time and showed if employees can utilize “insider” information to form profitable investment strategies.


First, based on the upward and downward change in perspective, Professor Snow classified the data into positive change and negative change in outlook and split the long-short portfolios into worsening outlook firms and improving outlook firms to observe their abnormal returns compared with a benchmark portfolio to see if the adjustments are timely information for the investors. The data was selected daily from Glassdoor’s reviews on companies within the 2015 fortune 1000 list because there are sufficient employee inputs from the large companies, which results in less noise for the metric. After that, for the model robustness, Professor Snow applied an overlapping rebalancing method to construct the portfolio’s constituent and compared the returns concerning the days relative to the outlook adjustments to observe whether employees take advantage of “insider” information.

The empirical result showed that the excessive return of the improving firm’s portfolio had a seven times smaller magnitude in negative return than the worsening one after adjustment. Then, he tested the robustness of the results for variables selected in value-weighted and equal-weighted on the descriptive statistics tables. The raw returns and t-stats below demonstrated a trait that good news often reaches the market faster, and bad news is often postponed in dissemination.


The effect can also be shown on the beta and alpha for the rebalancing before BOA.

Also, the sub-portfolio of large firms outperforms a whole sample and a subsample of small firms, which confirms the point stated previously that large firms are more informationally efficient (Snow,2016). The trend of portfolios in the figures of Portfolio Value proves the hypothesis that longing for the improving outlook portfolio and shorting the worsening outlook portfolio would gain value.

To further test the portfolio’s performance, professor Snow employed CAPM to measure risk-adjusted returns eight days around the business outlook adjustment date to obtain the result without market sensitivity. From the result of OLS regression, professor Snow concluded that the CAPM coefficient for the worsening portfolio is larger than the improving portfolio, which means downward outlook adjustments seem to cause a stronger reaction than upward outlook adjustments (Snow, 2016).

After conducting multiple tests of robustness, the study of the paper concludes that there is the existence of profitable long-short opportunities with employees’ reviews on Glassdoor. However, no concrete relationship patterns between stock returns and reviews can be defined (Snow, 2016). The possible conclusion might be because of the limitations such as unreliable data from the side of improving outlook due to the manipulation by the employers and thus was skewed (Snow, 2016). Another possible reason is the sample size since there are reviews for over 540000 firms on Glassdoor. However, more problems will emerge when including more samples because not all firms have enough reviews for outlook adjustments. Despite all the limitations, the application of alternative data has proven significant to market prediction.

More research is needed to solve the challenges of dealing with alternative data, such as wrong data input and data sparsity, especially after conducting one-hot encoding for categorical features. Fortunately, there is one possible solution to one of these challenges. During my internship as a data analyst at Tencent Technology Co., Ltd, I had the chance to study a series of machine learning algorithms called Factorization machine (FM). It is being vastly applied in the recommender system of targeted advertisement due to its characteristics of allowing parameter estimate under data sparsity.

Factorization Machine

Based on the linear regression model, the factorization machine imports cross-feature term as:


(SVM with Linear Kernel)

The last part of the equation uses matrix factorization to recomposite the weight

ij into dot product of <vi, vj> which reduces the independence of the features and allows feature interactions under the condition when either xi or xj is zero after one-hot encoding and outperforms Support Vector Machine (SVM) which needs both features to be a non-zero value.

The number of column k in the matrix v is also considered an essential factor in expressing the complexity of the interaction. Such characteristics allow the model to learn the correlations more efficiently between features when predicting the target and thus expand the potential application of alternative data.


For instance, when we try to analyze the relationship between company reviews and the stock market, we understand that most of the time, the positive reviews reflect an upward outlook causing a momentum effect on the stock market and otherwise the opposite. It will also be meaningful if we find out the interrelationships between the features of positive reviews and the companies’ outlook.

Scenario 2

With the inspiration from the factorization machine, I wonder if its superiority can be applied to Finance field. The article Financial Market Predictions with Factorization Machines: Trading The Opening Hour Based On Overnight Social Media Data discusses how FMs predict stock price using tweet data. The paper shares similar aspects with Professor Snow’s paper but with overnight data on Twitter and different models.


First of all, the research data contains 10 million tweets about S&P 500 companies from January 2014 to December 2015 and their associated minute-by-minute price. Then, the authors divide the data into 473 overlapping study periods.

Each period covers 30 formation days and one consecutive trading day. The goal in the formation period is to determine the most suitable stocks for the trading day by extracting the high-quality information from tweets and transforming them into document-term matrix X with rows as the number of tweets and columns as all the stemmed terms. Next, adapting the tweets to the respective future returns within 15 minutes after the stock market opening as target vector y. Then, feed the data to four models: SVM, Second-order FMs, third-order FMs, and adaptive-order FMs, and pick the stocks for the trading day with the minimal error between predicted and actual returns. The SVM only describes the naive relationship between the terms and future returns. At the same time, the FMs enable the model to learn the number of connections between terms concerning the increasing orders such that the third-order describes the interaction in three terms. Thus, searching for the most suitable FM order becomes crucial for performance optimization, and adaptive-order FMs (AFM) are introduced. The AFM works in the following consecutive steps: finding the highest dth order model with the lowest error under the 10-fold cross-validation procedure, selecting the best hyperparameter k for interaction complexity, optimizing the model parameter with the Markov Chain Monte Carlo (MCMC) method to save the effort in search of additional hyperparameter such as learning rate. Then, on the trading day, the author calculates the overnight return y0 of every picked stock from 4 pm to 9:30 am and subtracts it by the average predicted return per tweet m from 4 pm to 9:45 am to acquire the return within the 15 mins after the opening of the market.

This allows the author to capture the effect of tweets on future stock returns. Lastly, the author compares the return with the transaction cost and risk factors. If the return is larger than the transaction cost, we need to execute the long order and vice versa. For the robustness of the test, the author tests the profitability of different models in conditions of market friction, different time frames, and risk components.


All the tests show that AFM has a more robust performance than other models. The average daily return surpasses the best bootstrap trading within a million random bootstraps by 0.04 percent (Stubinger et al., 2018). As a result, the author demonstrates his strategy’s efficiency and observed that the increasing complexity of FM leads to a higher return.


Even though the AFM outperforms other models, its accuracy score only appears to be 61.76% which means there is still more room for alternative data applications to improve. Thus, it is exciting to see more factors considered in machine learning innovations with alternative data.


· Snow, D. (n.d.). B. alternative data. Retrieved April 18, 2022, from

· Snow, D. (2020, January 21). A portfolio strategy using Glassdoor’s business outlook ratings. SSRN. Retrieved April 18, 2022, from

· Rendle, Steffen. (2010). Factorization machines — 國立臺灣大學. Retrieved April 18, 2022, from

· Stubinger, J., Walter, D., & Knoll, J. (2018). 28© 2018 Conscientia Beam. All Rights Reserved.FINANCIAL MARKET PREDICTIONS WITH FACTORIZATION MACHINES: TRADING THE OPENING HOUR BASED ON OVERNIGHT SOCIAL MEDIA DATA. View of financial market predictions with factorization machines: Trading the opening hour based on overnight social media data. Retrieved April 18, 2022, from