Giter Site home page Giter Site logo

engdeep / cmpt-419 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alik604/wallstreetbets_lstm

0.0 0.0 0.0 10.85 MB

Undergraduate section of CMPT 419/726 Machine Learning: Theoretical justification for and practical application of, machine learning algorithms

Home Page: https://coursys.sfu.ca/2020sp-cmpt-726-x1/pages/

Jupyter Notebook 100.00%

cmpt-419's Introduction

Table of Contents

About The Project

The simultaneous rise in social media and discount brokerages has spurred the proliferation of online communities dedicated to investing and trading on the stock market. The most popular of these communities is the subreddit r/wallstreetbets (WSB). At over 1 million active subscribers, it is the fourth most popular subreddit at this time. Bloomberg Businessweek recently featured the subreddit in an article describing the message boards ability to reshape the options market and spark stock rallies. With such a large following and arguable influence, the question arises as to whether sentiment on the site can be successfully used as a predictor of stock performance.

Data

Stock Data: Wharton Research Data Services (WRDS), a division of the Wharton School of Business at the University of Pennsylvania, is a highly-regarded source in academic research and has a dedicated quality control analyst at the NYSE. Once granted access to their database, we were able to create a hourly stock price dataset from their consolidated Millisecond Trade and Quote (TAQ) dataset. Our dataset consists of hourly stock prices for Amazon, SPY, Boeing, SPY (S&P), and Tesla from Jan-01-2017 to Nov-31-2019. The selection criteria was most mentions for 2019 with Amazon and SPY surpassing the other 3.

r/wallstreetbets data from reddit: Pushshift is the largest publicly-available Reddit dataset containing both comments and submissions. We use Pushshift Comments dataset for r/wallstreetbets from Jan-01-2017 to Nov-31-2019.

GloVe - Global Vectors for Word Representation: GloVe is an unsupervised learning algorithm for obtaining word vector representations from a word corpus. The GloVe project has several publicly available pre-trained word vector datasets. We chose to use the Magnitude versions of GloVe Twitter (2 billion tweets containing 27 billion tokens, with a vocabulary size of 1.2 million) for 25-dimensional word embeddings and GloVe Common Crawl (web archive containing 840 billion tokens, with a vocabulary size of 2.2 million) for 300-dimensional word embeddings. Link : https://nlp.stanford.edu/projects/glove/

Methods

We embedded each comment by averaging the word vectors of each word in the comment to form a document vector. This generated 2 embeddings, a 25-dimensional embedding and a 300-dimensional embedding using the respective GloVe datasets.

This CPU-bound task alone takes several days on a single machine for one successful run. To speed up this computation, we parallelized the task with ~60 "idle" Computing Science Instructional Laboratory (CSIL) machines with 12 CPU cores each. In addition to the standard document vectors, we appended score, gildings, and word count of each post as additional features.

The word vectors were then aggregated over hours and days separately creating 2 final datasets GloVe datasets, which was then joined with our stock ticker prices.

Results

From our investigation we found that time series forecasting is aided by an increased granularity to the data. The LSTM architectures we tested benefited from both a finer stock price aggregation, and also from a shorter sequence length. The latter is particularly interesting because many technical indicators use a sequence length of about 30 time-steps. In terms of baselining, we found our LSTM to be competitive against some non neural network based models we evaluated.

Future work

In terms additional impairments which can be directly added to our work, there is both benefit to be gained with both more tuning and data. On the tuning side, more hyperparameter optimization could be conducted and a more powerful model such as a LSTM with attention may be employed. While on the data front, additional features sure as technical indicators, namely the relative strength index, as well as key securities’ features such as dividends, volume, business sector, could all be potentially beneficial.

Averaging the constituent word vectors of a document to form a document vector is a primitive approach which loses information about word order and importance. Future work should explore the use of a trainable neural network document vector model such as Doc2Vec which extends word vectors to documents in a more sophisticated fashion.

A future document model could be trained specifically on finance-related news and social media and could incorporate special handling of stock tickers such as “TSLA.” Such a finance-specialized model could provide better results.

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

This is an example of how to list things you need to use the software and how to install them.

  • Install dependencies
common ML packages

Runing

  1. Clone the repo
git clone https://github.com/alik604/This_repo.git
  1. Open notebook
jupyter notebook

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Khizr Ali Pardhan - Email - @alik604 LinkedIn

David Pham - Email

Winfield Chen

cmpt-419's People

Contributors

alik604 avatar ddkphamjam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.