Light

christopherliew / crypto-uncertainty-index Goto Github PK

End to end natural language processing, machine learning and data engineering pipelines for a social media text based cryptocurrency uncertainty index.

Python 97.70% Makefile 1.37% SQL 0.30% Erlang 0.63%

cryptocurrency data-engineering econometrics elasticsearch kibana machine-learning natural-language-processing transformer-models

crypto-uncertainty-index's Introduction

Hey there, I'm Christopher 👨🏻‍💻

I am a software engineer whose interests and passions lie in databases, distributed systems, big data processing as well as natural language processing. You can find my passion + learning / school projects messily dumped here (cos that's what repositories are for right), though its due for some committed progress and refactoring (tech debt is real 😳) and other messily strewn and half finished notes here on tech-related readings and exploits.

Languages & Tools

Random & Meaningless Stats

crypto-uncertainty-index's People

Contributors

Stargazers

Watchers

crypto-uncertainty-index's Issues

Update README once all is done

Update CLI docs + Add Examples
Update Results
Update set up

Build baseline Lucey et al Keyword Uncertainty Index

Develop aggregation logic
Construct relevant pipelines

Refactor Pipelines

Naming conventions and functional abstraction needs to be tidied up:

Refactor dataclass to pydantic
Refactor data engineering pipelines

Add in Rich Tables for Pipelines

Add in relevant Utils

Abstract where possible

Refactor CLI

Add in help

Fix Poetry integration

Pull reddit data from pushshift.io

Code up extraction script and utilities
Build & Test Data Extraction and Loading Pipelines
Extract and load all crypto data
Do up data dictionary

Perform Topic Modelling to create Enhanced Keyword Uncertainty Index

Extract raw data from pkl files and Process
Training Pipelines / Script for Topic Modelling
Test out topic modelling algorithms
Build up topics and define lexicon for uncertainty related topics and word distributions
Construct new index
Clean up noise in text
Remodel with LDA with Gensim / Mallet for K = {2, 9} and Compute BIC/ Bayes Factor
Run Top2Vec with Doc2Vec and tune for fewer clusters
Test out Bigrams (Debugging in Progress / KIV)

Set up Repository with Elasticsearch, Kib and Postgres

Set up local ES, Kib and Postgres
Set up Helpers and Schemas

Fine Tune / Transfer Learn Hedge Classifier

Prepare Training Data Preprocessing Pipeline
Prepare model tuning scripts with W&B

Transformer

Vinzsce's Model

Hyperparam Tune with RayTune + W&B
Select model and perform inference
Build Hedge based Uncertainty Index

Train and Evaluate Forecasting Models

Build baseline models and evaluate
Rework Time Series analysis
Rework / Tune Forecasting models
Evaluate using Diebold Mariano
Enhance BTC-USD data with more technical analysis indicators
Build Dynamic Factor Model to synthesise TA indicators

Pull Wiki Weasel Data

Explore Wiki Dumps and obtain initial raw data
Process data and construct Wiki Weasel initial dataset
[Good to Have] Explore Edit History to enrich Wiki Weasel tags

Text analysis and lexicon build up

EDA, Lexicon enhancement, etc. prior to building our text based keyword uncertainty index

Explore raw data (Using notebooks & Kibana dashboards)
Explore emoticons and crypto lexicons to create mappings for processing
Reindex existing index to new ES index with custom text analyzer
Build up preliminary text preprocessing and relevant pipelines using Elasticsearch / SpaCy / NLTK and others

Perform Exploratory Analysis with Other Asset Indicators

Build pipelines to pull yfinance and other related ticker data
Exploratory time series analysis with Lucey Index

Fix Elasticsearch indexing field

Current indexing field is automatically created by ES upon bulk insert, to prevent duplication (i.e. idempotence) when re-running extraction pipelines, the _id field should use reddit's own comment / submission id which is always unique.

Explore Kedro to modularize code

If time permits, explore using kedro to modularise code. See docs: https://kedro.readthedocs.io/en/stable/

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.