Films Revenue Data Scrapping From Wikipedia.

This project focuses on extracting, cleaning, and transforming data on films that generated the most revenue from Wikipedia. Using Python for data extraction and cleaning, the project integrates with PostgreSQL, MSSQL Server, and MySQL for data storage. The entire process is managed within a Docker environment, ensuring consistency and portability across different systems

Project Architecture

Objectives

Data Extraction: Extract relevant data on high-revenue films from Wikipedia.
Data Cleaning: Process and clean the extracted data to ensure accuracy and consistency.
Data Loading: Load the cleaned data into PostgreSQL, MSSQL Server, and MySQL databases.
Management: Utilize Docker to create a consistent and portable environment for the project.
Analysis: Enable further analysis on the films' revenue data across different database systems.

Skills

Web Scraping: Extracting data from Wikipedia using Python libraries such as BeautifulSoup and requests.
Data Cleaning: Using Python (pandas) to clean and process raw data.
Database Management: Storing and managing data in PostgreSQL, MSSQL Server, and MySQL.
Docker: Creating and managing Docker containers to ensure a consistent development environment.
SQL: Writing efficient SQL queries for data manipulation and retrieval.

Tools

Python: The primary programming language used for web scraping and data cleaning.
BeautifulSoup: For parsing HTML and extracting data from web pages.
Pandas: For data manipulation and cleaning.
MySQL: First target database for storing the cleaned data.
PostgreSQL: Second target database for storing the cleaned data.
MSSQL Server: Third target database for storing the cleaned data.
Docker: Containerization environment, ensuring consistency across different systems

Usage : Docker file

Ensure you have docker, docker compose and its requirements installed in your system.

Clone the repository

git clone https://github.com/Samuel-Njoroge/films-revenue-data-scrapping

cd films-revenue-data-scrapping

Start the docker containers by running the docker-compose.yml

docker compose up

Contributing

Contributions to improve the project are welcome!

Please feel free to fork the repository and submit a pull request.

samuel-njoroge / films-revenue-data-scrapping Goto Github PK

films-revenue-data-scrapping's Introduction

Films Revenue Data Scrapping From Wikipedia.

Project Architecture

Objectives

Skills

Tools

Usage : Docker file

Contributing

films-revenue-data-scrapping's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent