Data Collection

Financial data was collected in the form of orex indices and primary stock market indices from the 17 countries identified in the Asia Pacific region across a 10 year period from 2010-2019. This data was obtained on a daily basis in the OHLCV format consistenly across the selected time period

Social media data was collected from the popualr social media of Twitter and Reddit. On Twitter, hashtags for the names of the countries, their capitals and their world leaders across the 10 year period were scraped. On Reddit, the subreddits for the names of the countries and their capitals were scraped.

Financial data was preprocessed by creating new variables using the collected data to generate additional features that might be useful to the process of prediction.

Social media data was passed through BERT, a sentiment analysis network to generate sentiment scores for each individual post. These scores were aggreagated to create statistics on a daily basis to be used in the modelling approach.

Data Preprocessing and Feature Engineering

Exploratory Data Analysis

Exploratory data analysis on the data included tests to check seasonality and stationarity of the data. Raw values not being led stationary led to conversion of variables to their returns to be used in the machine learning models. Periodic seasonality was exploited in the time series models.

Correlation analysis of the data against itself and with other currencies and markets led to interesting insights. International indices with strongest correlation scores were used as cross domain features to support the aim of the project.

Experiments were conducted using the different types of models of machine learning and classifical time series to regress the values of the returns. However, no satisfactory results led to the shift of focus to classification, first experimented with bins and then on a binary scale with which satisfactory results were seen.

The experimentation phase led to important results regarding the modelling approach. The indices with better results along with their cross domain features were selected for optimization. PCA proved to be helpful in reducing dimensionality while the sentiment scores did not provide any new information to help predictions.

Experimentation

Optimization

The opmization phase involved selecting the best hyperparameters for the currencies and models selected. Iterating over the array of hyperparameters, the most optimal ones were chosen.

A walk forward approach helped optimize the results instead of the traditional train test split approach by providing better results. In addition, ensemble voting was applied over the best models for each index to refine results more and give a higher accuracy.

Several backtesting algorithms were considered and the pros and cons weighted for each to determine the appropriate trading strategy to test our predictions with the market. An intraday trading strategy was selected for this approach.

Based on the trading strategy, both a buy and a sell would be made on a daily basis. The position would be a long or a short depending on the prediction for the following day. Across the indices selected, consistenly good performance with positive equity returns were seen for the backtesting period.

Backtesting