Giter Site home page Giter Site logo

nft_ethereum_etl's Introduction

NFT_Ethereum_ETL

Overview

The goal for this was project was to develop and deploy a scalable pipeline on the cloud using data collected from 3 seperate API's. A primary data set from which this pipeline extracts from is OpenSea's NFT marketplace, which holds data regarding NFT collections such as transaction price, transaction date, owner_address, etc. The data for highest valued NFT collections will first be extracted and then enriched with data from CrytoCompare and EtherScan to ultimately enable analysis of NFT price/transactions statistics as well as analysis of each owner's Ethereum wallet token balances. The projects utilizes solutions such as:

  1. Azure Blob Storage
  2. Azure DataBricks Cluster
  3. Spark
  4. Airflow and DataBricks Orchestrator

Step 1: Data Acquisition

  • Python script ingesting sample data from OpenSea NFT market place and storing the raw data

Step 2: Data Exploration

  • Notebook for exploring raw OpenSea data, as well as for understanding extraction/enrichment methods and data from EtherScan and CryptoCompare

Step 3: Pipeline Prototype

  • Pipeline using Airflow to schedule and orchestrate the complete end-to-end steps of extracting and enriching data from all API sources, performing necessary transformations, and loading to persistence layers

Step 4: Pipeline Deployment

  • Upload data to Azure Blob and deploy scripts to DataBricks cluster using DataBricks built-in orchestration tool to schedule jobs

Step 5. Pipeline Monitoring:

  • Uses Ganglia dashboard, a distributed monitoring system for high-performance computing systems to check metrics related to CPU, memory usage, network

Diagram

alt text

Directory Details

  • Cloud_Deployment: Scripts, reference rotebooks, and images relating to pipeline cloud deployment using DataBricks Cluster
  • Local_Deployment: Scripts, reference rotebooks, and images relating to pipeline local deployment
  • Exploratory_Data_Analysis: Notebooks used to understand source data from API's and determine methods of extraction and transformation
  • Sample_Data_Acquisition: Script for acquiring sample data from API
  • Testing: Script for testing and validating each stage of pipeline

Local

  • Dependencies: Spark, Airflow, and source data available on local environment
  • Run: Run local python and DAG scripts with the correct configurations and paths

Production

  • Dependencies: Microsoft Azure Account, with Databricks cluster and Blob container resources set up. Spark configurations updated with Microsoft Account and Blob information.
  • Set-up: Using either Azcopy or Azure Storage-Explorer, populate the Blob container with source data. In addition, in order to make our data accessible by DataBricks, we need to mount our Blob container onto the DataBricks cluster. After mounting data, if there are certain modules that need to be accessed by Pyspark scripts (e.g. API configs etc),then we need to add the DFBS filepath of modules to sparkcontext using spark.sparkContext.addPyFile('dbfs:/mnt/path/to/module/ )
  • Run: Use DataBricks orchestration tool to schedule jobs, add dependencies, and pass in parameters. In this version, we are using the orchestration tool to schedule Python scripts rather than notebooks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.