Giter Site home page Giter Site logo

pathairush / rdbms_to_hdfs_data_pipeline Goto Github PK

View Code? Open in Web Editor NEW
13.0 2.0 1.0 47.99 MB

A data pipeline moving data from a Relational database system (RDBMS) to a Hadoop file system (HDFS).

Python 73.17% Shell 15.57% HiveQL 11.25%
postgresql sqoop spark hive hadoop docker

rdbms_to_hdfs_data_pipeline's Introduction

RDBMS to HDFS data pipeline

img

This project aims to move the data from a Relational database system (RDBMS) to a Hadoop file system (HDFS). Here are the technology stacks we will use in this project.

  • PostgreSQL
  • Apache Sqoop
  • Apache Hive
  • Apache Spark

The learning objectives

  1. To create a modified docker file based on your use cases.
  2. To know how to set up the component with docker in both traditional and distributed environments.
  3. To craft a data pipeline for moving data from the traditional system to the distributed system.
  4. To learn how to communicate between containers in the same environment.
  5. To build data quality checkers in the intermediate step of the data pipeline.
  6. To be familiar with the basic usage of sqoop, hive, and spark.

Required

  1. Docker desktop
  2. Docker compose

Architecture compatability

  • linux/amd64

How to run this project

I provide you a docker-compose file so that you can run the whole application with the following command.

docker-compose up -d

Then you can access the Airflow UI webserver through port 8080

Please feel free to turn the dag button on for the hands_on_test. It sets a start_date to days_ago(1) and schedule to run on a daily basis.

Assume that the pipeline is run completely. You can test the result on the following components like this.

PostgresSQL

# show table in database
docker exec postgres-db psql -U postgres -d lineman_wongnai -c \\dt
# describe table
docker exec postgres-db psql -U postgres -d lineman_wongnai -c "
SELECT 
   table_name, 
   column_name, 
   data_type 
FROM 
   information_schema.columns
WHERE 
   table_name = '<<TARGET TABLE>>';
"
# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'
# sample data
docker exec postgres-db psql -U postgres -d lineman_wongnai -c "SELECT * FROM <<TARGET TABLE>> LIMIT 5;"

# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'

HIVE

# show tables
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SHOW TABLES;"
# describe table
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SHOW CREATE TABLE <<TARGET TABLE>>;"

# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'
# sample data
docker exec hive-server beeline -u jdbc:hive2://localhost:10000/default -e "SELECT * FROM <<TARGET TABLE>> LIMIT 5;"

# change <<TARGET TABLE>> to your table name e.g., 'order_detail', 'restaurant_detail'
# check partitioned parquet
docker exec hive-server hdfs dfs -ls /user/spark/transformed_order_detail
docker exec hive-server hdfs dfs -ls /user/spark/transformed_restaurant_detail

# check the source of external table in ./airflow/scripts/hql script.

For SQL requirement files, the CSV files will be placed in the ./sql_result when the dag is completed.

After you finish the test, you can close the whole application by

docker-compose down -v

Troubleshooting

  1. Cannot spin up docker-compose because the port has been used.
    • check the mapping port on the host machine in docker-compose.yaml file. Changing it to another open port. e.g., '10000:'10000' to '10001:10000'
  2. If you use arm64 architecture to run this project, the import_sqoop task will be failed. Please swith to run it in the linux/amd64 architecture.

Business requirement

  1. Create two tables in postgres database with the above given column types.

    • order_detail table using order_detail.csv
    • restaurant_detail table using restaurant_detail.csv
    • Check COPY command in airflow/dags/script/sql_queries.py
  2. Once we have these two tables in postgres DB, ETL the same tables to Hive with the same names and corresponding Hive data type using the below guidelines

    • Both the tables should be external table
    • Both the tables should have parquet file format
    • restaurant_detail table should be partitioned by a column name dt (type string) with a static value latest
    • order_detail table should be partitioned by a column named dt (type string) extracted from order_created_timestamp in the format YYYYMMDD
    • Check DESCRIBE TABLE and SAMPLE DATA commands in the previous HIVE section.
  3. After creating the above tables in Hive, create two new tables order_detail_new and restaurant_detail_new with their respective columns and partitions and add one new column for each table as explained below.

    • discount_no_null - replace all the NULL values of discount column with 0
    • cooking_bin - using esimated_cooking_time column and the below logic
      • I saw that there are a gap for the estimated cooking time below 10 minutes, so I set the default value for outside the provided logic to be null value.
    • Check SAMPLE DATA commands in the previous HIVE section. You can edit query to test the requirement such as SELECT COUNT(*) FROM order_detail_new WHERE discount_no_null IS NULL. The expected result would be 0.
    estimated_cooking_time cooking_bin
    10-40 1
    41-80 2
    81-120 3
    > 120 4
  4. Final column count of each table (including partition column):

    • order_detail = 9
    • restaurant_detail = 7
    • order_detail_new = 10
    • restaurant_detail_new = 8
    • Check DESCRIBE TABLE command in the previous HIVE section.
  5. SQL requirements & CSV output requirements

    • Get the average discount for each category
      • I'm not sure whether you need the average of discount or discount_no_null columns, so I calculate it both. It could lead to different business interpretation.
    • Row count per each cooking_bin
    • Check the result in ./sql_result for discount.csv and cooking.csv

Technical Requirements

  • Use Apache Spark, Apache Sqoop or any other big data frameworks
  • Use a scheduler tool to run the pipeline daily. Airflow is preferred
  • Include a README file that explains how we can deploy your code
  • (bonus) Use Docker or Kubernetes for up-and-running program

DAG

img

DAG overviews

  1. We orchestrate and schedule this project with airflow. It will run on a daily basis. Each task will have 1 times retry.
  2. For the postgres, part we use the PostgresOperator to submit the SQL script provided in airflow/scripts/sql_queries.py to create postgres tables.
  3. We provide a custom PostgresDataQualityOperator to check that the number of row for each table matched with the source data in ./data folder.
  4. For the second part of this pipeline, we ingest data from RDBMS system to HDFS system with sqoop. Unfortunately, with the best understanding I have about the HADOOP ecosystem, I can't set up the sqoop container for isolating this operation. So, I work around by installing sqoop component in hive-server. Then I use the BashOperator to ingest the data from shell script. The data will be placed in hive-server hdfs filesystem under /user/sqoop/ folder. For the source code and required components for sqoop you can find it in ./airflow/othres/sqoop
  5. Also, for the hive and spark components, I found trouble to use the SparkSubmitOperator, and HiveOperator from the airflow containers. So I work around by modified the airflow docker image to be able to use the docker command within the container. You can find the script to build that image in dockerfile and requirements.txt files. I have already pushed modified image to docker hub so you don't to to build it yourself.
  6. With the docker command within the airflow container, I can continue working on the spark and hive things through the docker exec command. It use the BashOperator to run those scripts. For spark script, you can find the source code in ./airflow/dags/scripts/spark. For hive script, you can find the source code in ./airflow/dags/scripts/hive
  7. For both Postgres and Hive I use the drop-create style . For the spark output, the default mode is overwrite.
  8. For sql requirement, when the dag is completed the csv files will be placed in ./sql_result folder.

Question output

  • Source code
  • Docker, docker-compose, kubernetes files if possible.
  • README of how to test / run

rdbms_to_hdfs_data_pipeline's People

Contributors

pathairush avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jiroakira

rdbms_to_hdfs_data_pipeline's Issues

Error Task install sqoop

tar: This does not look like a tar archive
[2022-04-02 19:23:00,929] {{bash_operator.py:126}} INFO -
[2022-04-02 19:23:00,929] {{bash_operator.py:126}} INFO - gzip: stdin: not in gzip format
[2022-04-02 19:23:00,929] {{bash_operator.py:126}} INFO - tar: Child returned status 1
[2022-04-02 19:23:00,929] {{bash_operator.py:126}} INFO - tar: Error is not recoverable: exiting now
[2022-04-02 19:23:00,930] {{bash_operator.py:126}} INFO - mv: cannot stat '/opt/sqoop/sqoop-1.4.7.bin__hadoop-2.6.0': No such file or directory
[2022-04-02 19:23:00,932] {{bash_operator.py:126}} INFO - rm: cannot remove 'sqoop-1.4.7.bin__hadoop-2.6.0': No such file or directory
[2022-04-02 19:23:00,933] {{bash_operator.py:126}} INFO - cp: cannot create regular file '/usr/lib/sqoop/conf/sqoop-env.sh': No such file or directory
[2022-04-02 19:23:00,934] {{bash_operator.py:126}} INFO - cp: cannot create regular file '/usr/lib/sqoop/lib': No such file or directory
[2022-04-02 19:23:00,942] {{bash_operator.py:130}} INFO - Command exited with return code 1
[2022-04-02 19:23:00,953] {{taskinstance.py:1128}} ERROR - Bash command failed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.