Giter Site home page Giter Site logo

neosis / big-data-processor Goto Github PK

View Code? Open in Web Editor NEW

This project forked from margueriteblair/big-data-processor

0.0 0.0 0.0 215 KB

Two minimum viable products that will import a ~6 million record .csv file into PostgresSQL. One method uses batch processing, the other uses Stateless Sessions to loop through the data and insert it into the appropriate row/column in a SQL database.

Java 100.00%

big-data-processor's Introduction

Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Average runtime for the batch processor with a ThreadPoolTaskExecutor is 2 minutes 33 seconds. Average runtime for the stateless sessions parser/processor is 40 minutes. Both of these methods will be improved upon in the future by incorporating a MultiResourcePartitioner within the Spring Batch Configuration file, as well as splitting the large dataset into smaller sets, so that multiple threads may operate on different files at a given time.

This project:

  • Uses Spring Boot service uses Spring Batch with Spring Data JPA-Hibernate.
  • Imports data from a CSV file (about 6 million records) to a PostgreSQL database.
  • Improved batch processing performance from implementing a ThreadPoolTaskExecutor to achieve data chunking and multithreaded code.
  • Based on this data, a fraud detection model is built using python machine learning libraries.
  • Is intended to be launched through an API Gateway server (linked below).
  • Instructions to run:

      1. Clone this repository to your local machine.
      2. Download the financial data from Kaggle. Add this data to "resource/data" and be sure to include the .csv file in your .gitignore!
      3. Within main/java/com there are two distinct packages, "batch" and "session", which are the batch processor and sessions processor respectively.
      4. Each package has it's own main file that can be ran
      5. Once the application is launched without issues, head over to Postman and test on your configured port and the route "/load"

    Technologies Used

  • Java
  • Spring Boot for REST API
  • Spring Batch Processing (Open Source Data Processing Framework)
  • Maven
  • Factory Design Pattern within Batch Processor
  • Hibernate
  • Java Persistence API (JPA)
  • PostgreSQL
  • Gateway Server Communication. Gateway Server can be found here.
  • big-data-processor's People

    Contributors

    margueriteblair avatar

    Recommend Projects

    • React photo React

      A declarative, efficient, and flexible JavaScript library for building user interfaces.

    • Vue.js photo Vue.js

      ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

    • Typescript photo Typescript

      TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

    • TensorFlow photo TensorFlow

      An Open Source Machine Learning Framework for Everyone

    • Django photo Django

      The Web framework for perfectionists with deadlines.

    • D3 photo D3

      Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

    Recommend Topics

    • javascript

      JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

    • web

      Some thing interesting about web. New door for the world.

    • server

      A server is a program made to process requests and deliver data to clients.

    • Machine learning

      Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

    • Game

      Some thing interesting about game, make everyone happy.

    Recommend Org

    • Facebook photo Facebook

      We are working to build community through open source technology. NB: members must have two-factor auth.

    • Microsoft photo Microsoft

      Open source projects and samples from Microsoft.

    • Google photo Google

      Google โค๏ธ Open Source for everyone.

    • D3 photo D3

      Data-Driven Documents codes.