In this project we used Transformers architecture - encoder-decoder, to predict Bitcoin value into a chosen future horizon. Our raw data holds almost 1 year of Bitcoin prices per minute (closing, opening, etc.). We extracted more statistics out of the data using common financial technical indicators - Finta, while making sure they are low correlated between them. We then fed the model with the data and trained it to predict the chosen future horizon based on past values. We optimized the hyperparameters of the model using Optuna.
Bitcoin prediction using RNN:
https://www.kaggle.com/muharremyasar/btc-historical-with-rnn
IBM stock price prediction using Transformer-Encoder:
Short term stock price prediction using LSTM with a simple trading bot:
https://github.com/roeeben/Stock-Price-Prediction-With-a-Bot/blob/main/README.md
We are using Bitcoin historical one-minute records from (UTC+8): 2021-01-01 00:00:00 - 2021-12-05 23:59:00, containing 488,160 records from Okex Exchange. We got it from: https://www.kaggle.com/aipeli/btcusdt and it can also be found in our repository here.
Without regard to the time stamp feature, the data contains 5 features: the opening price, highest price, lowest price, closing price, and volume of transactions per minute. We calculated the correlations between the features and noticed that the first 4 (the prices) are high-correlated between themselves. So we wanted to add more meaningful features to the data before handing it to the model. For that, we used FinTA which implements common financial technical indicators in Pandas. We chose only the features which are low-correlated to all others and made sure they all use only past samples (so we won't accidentally use the future). After choosing them we cleaned it from NaNs and ended up with a total of 34 features and 488029 samples (lost the first 131 samples).
After that we split the data into train (80%), validation (10%), and test (10%), in chronological order as can be seen here:
Then the train data is being scaled, and the validation and test datasets are scaled accordingly.
Finally, we divided the train set into tensors of large sequential batches. During the training, we will sample from each batch a sequence of bptt_src
to use as source and a sequence of bptt_tgt
to use as target. To create more diverse data, we can start to sample from a random start point in each epoch, by setting the flag random_start_point
True.
We used PyTorch nn.Transformer as the basis of our model. Before both encoder and decoder, we entered a time embedding layer and in the output of the decoder a linear one.
In the time embedding layer, we are implementing a version of Time2Vec. We added more features to the data in 2 ways:
- Periodic features which implemented as a linear layer followed by sin activation - a total of
periodic_features
features. - Linear features which implemented as a linear layer.
Both kinds of features are concatenated to the existing ones creating a total of out_features
at the output.
The linear layer before the output is used to output the same number of features as the target - in_features
.
The model structure:
num_features
= int, number of features to choose from the full set (1 - 34)scaler
= str, the kind of scaler to use to scale the data (Standard Scaler - 'standard', Min Max Scaler - 'minmax')train_batch_size
= int, size of train batcheval_batch_size
= int, size of validation/test batchepochs
= int, number of epochs to run the trainingbptt_src
= int, the length of the source sequencebptt_tgt
= int, the length of the target sequenceoverlap
= int, number of overlapping samples between the source and the targetnum_encoder_layers
= int, number of enconder layers in the transformernum_decoder_layers
= int, number of decoder layers in the transformerperiodic_features
= int, number of periodic features to add in the time embedding layerout_features
= int, number of output feature after the time embedding layer (>in_features + periodic_features
)nhead
= int, number of heads in the multihead attention layers in the transformer (both encoder and decoder, must be a divider ofout_features
)dim_feedforward
= int, dimension of the feed forward layers in the transformer (both encoder and decoder)dropout
= float, the dropout probability of the dropout layers in the model (0.0 - 1.0)activation
= str, activation function to use in the transformer (ReLU - 'relu', GeLU - 'gelu')random_start_point
= bool, start each epoch from random start point from the firstbptt_src
samplesclip_param
= float, the max norm of the gradients in the clip_grad_norm layerlr
= float, starting learning rategamma
= float, multiplicative factor of learning rate decay (0.0 - 1.0)step_size
= int, period of learning rate decay in epochs
The most crucial thing to understand here is the relations between bptt_src
, bptt_tgt
and overlap
. We use bptt_src
past samples to predict the following bptt_tgt - overlap
.
We used Optuna to find the optimal hyperparameters in terms of the validation loss.
We fixed or constrained some of the hyperparameters by using the knowledge we gained during the manual tuning, to make runtime more reasonable:
Hyperparameter | Value |
---|---|
num_features |
34 |
train_batch_size |
32 |
eval_batch_size |
32 |
epochs |
50 |
overlap |
1 |
num_decoder_layers |
num_encoder_layers |
periodic_features |
(out_features - num_features // 10) x 4 + 2 |
nhead |
out_features / 4 |
step_size |
1 |
lr |
0.5 |
step_size |
1 |
gamma |
0.95 |
For the other hyperparameters, we chose the range of possible values to optimize over.
These are the hyperparameters that were chosen:
Hyperparameter | Value |
---|---|
scaler |
'minmax' |
bptt_src |
10 |
bptt_tgt |
6 |
num_encoder_layers |
4 |
out_features |
60 |
dim_feedforward |
384 |
dropout |
0.0 |
random_start_point |
'False' |
clip_param |
0.75 |
The impact of these hyperparameters on the loss is visualized here:
The most important one is the scaler
.
We also saw that bptt_src
and bptt_tgt
were not as important as we thought they would be.
After our final fine-tuning, we only changed bptt_tgt
from 6 as suggested by optuna to 2.
The full analysis by Optuna can be found in bitcoin_price_prediction_optuna.ipynb
We trained the model with the hyperparameters above.
Model statistics:
After this training we checked the real-time performance of the model on the test set, meaning we entered the first 10 samples as source (bptt_src
= 10), the 10th sample in this sequence as target (overlap
= 1) and predicted the next value (bptt_tgt
- overlap
= 1). We then shifted the source samples by one and predicted the next value in the same way. We repeated the process until we had the prediction for all the possible minutes in the test set. You can see the result here:
We can see that the general trend of the prediction is similar to the real one. That is not surprising because we are looking at a large scale of minutes, 48,802, where the prediction is only based on the last 10 samples, so on the large scale, we expect to see both real and prediction around the same values. For better analysis we need to look closer, so here is a zoom-in view:
Here we can see the differences between the real and predicted values. The trends are still somewhat similar but sometimes the prediction predicts a rise or fall before it happens, for example, the rise around the minute 42,560 or the little fall in minute 42,580, and sometimes it just follows the existing trend like around minute 42,600.
To retrain the model run bitcoin_price_prediction.ipynb after you chose your hyperparameters in the first cell. The flag plot_data_process
when set False will hide all the produced data processing images.
If you would like to do further hyperparameters tuning using optuna run bitcoin_price_prediction_optuna.ipynb. In the define_model
function we declared the values of the fixed or constrained hyperparameters and in the objective
function we declared the hyperparameters we want to tune along with their range.
Folder | File name | Purpose |
---|---|---|
code | bitcoin_price_prediction.ipynb |
Notebook which includes all data processing, training, and inference |
bitcoin_price_prediction_optuna.ipynb |
Optuna hyperparameters tuning | |
data | okex_btcusdt_kline_1m.csv.zip |
Zip file containing the data we used in this project |
images | Data_Separation.png |
Image that shows our train-validation-tets split |
Model_Structure.png |
Image that shows our model architecture | |
Optuna_Result.jpeg |
Image that shows the importance of the Hyperparameters produced by Optuna | |
Test_Prediction.png |
Image that shows our result on the test set | |
Test_Presiction_Zoom_In.png |
Image that shows our result on the test set - zoomed-in | |
presentation_preview.gif |
Gif showing preview of the project presentation |
The work we presented here achieved good results, but definitely there are aspects to improve and examine such as:
- Try running the model on a different stock.
- Examine the feature extraction process and check which features are the most helpful.
- Further tuning of the hyperparameters, release the constraints we put on some of them.
- Check the performance in real-time trading (better to start with a trading bot on the test set)
Hope this was helpful and please let us know if you have any comments on this work: