Giter Site home page Giter Site logo

christopherliew / crypto-uncertainty-index Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 133.9 MB

End to end natural language processing, machine learning and data engineering pipelines for a social media text based cryptocurrency uncertainty index.

Python 97.70% Makefile 1.37% SQL 0.30% Erlang 0.63%
cryptocurrency data-engineering econometrics elasticsearch kibana machine-learning natural-language-processing transformer-models

crypto-uncertainty-index's Introduction

Hey there, I'm Christopher 👨🏻‍💻

I am a software engineer whose interests and passions lie in databases, distributed systems, big data processing as well as natural language processing. You can find my passion + learning / school projects messily dumped here (cos that's what repositories are for right), though its due for some committed progress and refactoring (tech debt is real 😳) and other messily strewn and half finished notes here on tech-related readings and exploits.

Languages & Tools

                       

Random & Meaningless Stats

crypto-uncertainty-index's People

Contributors

christopherliew avatar

Stargazers

 avatar

Watchers

 avatar  avatar

crypto-uncertainty-index's Issues

Refactor Pipelines

Naming conventions and functional abstraction needs to be tidied up:

  • Refactor dataclass to pydantic
  • Refactor data engineering pipelines
  • Add in Rich Tables for Pipelines
  • Add in relevant Utils
  • Abstract where possible
  • Refactor CLI
  • Add in help
  • Fix Poetry integration

Pull reddit data from pushshift.io

  • Code up extraction script and utilities
  • Build & Test Data Extraction and Loading Pipelines
  • Extract and load all crypto data
  • Do up data dictionary

Perform Topic Modelling to create Enhanced Keyword Uncertainty Index

  • Extract raw data from pkl files and Process
  • Training Pipelines / Script for Topic Modelling
  • Test out topic modelling algorithms
  • Build up topics and define lexicon for uncertainty related topics and word distributions
  • Construct new index
  • Clean up noise in text
  • Remodel with LDA with Gensim / Mallet for K = {2, 9} and Compute BIC/ Bayes Factor
  • Run Top2Vec with Doc2Vec and tune for fewer clusters
  • Test out Bigrams (Debugging in Progress / KIV)

Fine Tune / Transfer Learn Hedge Classifier

  • Prepare Training Data Preprocessing Pipeline
  • Prepare model tuning scripts with W&B
  • Transformer
  • Vinzsce's Model
  • Hyperparam Tune with RayTune + W&B
  • Select model and perform inference
  • Build Hedge based Uncertainty Index

Train and Evaluate Forecasting Models

  • Build baseline models and evaluate
  • Rework Time Series analysis
  • Rework / Tune Forecasting models
  • Evaluate using Diebold Mariano
  • Enhance BTC-USD data with more technical analysis indicators
  • Build Dynamic Factor Model to synthesise TA indicators

Pull Wiki Weasel Data

  1. Explore Wiki Dumps and obtain initial raw data
  2. Process data and construct Wiki Weasel initial dataset
  3. [Good to Have] Explore Edit History to enrich Wiki Weasel tags

Text analysis and lexicon build up

EDA, Lexicon enhancement, etc. prior to building our text based keyword uncertainty index

  • Explore raw data (Using notebooks & Kibana dashboards)
  • Explore emoticons and crypto lexicons to create mappings for processing
  • Reindex existing index to new ES index with custom text analyzer
  • Build up preliminary text preprocessing and relevant pipelines using Elasticsearch / SpaCy / NLTK and others

Fix Elasticsearch indexing field

Current indexing field is automatically created by ES upon bulk insert, to prevent duplication (i.e. idempotence) when re-running extraction pipelines, the _id field should use reddit's own comment / submission id which is always unique.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.