Giter Site home page Giter Site logo

films-revenue-data-scrapping's Introduction

Films Revenue Data Scrapping From Wikipedia.

This project focuses on extracting, cleaning, and transforming data on films that generated the most revenue from Wikipedia. Using Python for data extraction and cleaning, the project integrates with PostgreSQL, MSSQL Server, and MySQL for data storage. The entire process is managed within a Docker environment, ensuring consistency and portability across different systems

Project Architecture

data_scrapping

Objectives

  • Data Extraction: Extract relevant data on high-revenue films from Wikipedia.
  • Data Cleaning: Process and clean the extracted data to ensure accuracy and consistency.
  • Data Loading: Load the cleaned data into PostgreSQL, MSSQL Server, and MySQL databases.
  • Management: Utilize Docker to create a consistent and portable environment for the project.
  • Analysis: Enable further analysis on the films' revenue data across different database systems.

Skills

  • Web Scraping: Extracting data from Wikipedia using Python libraries such as BeautifulSoup and requests.
  • Data Cleaning: Using Python (pandas) to clean and process raw data.
  • Database Management: Storing and managing data in PostgreSQL, MSSQL Server, and MySQL.
  • Docker: Creating and managing Docker containers to ensure a consistent development environment.
  • SQL: Writing efficient SQL queries for data manipulation and retrieval.

Tools

  1. Python: The primary programming language used for web scraping and data cleaning.
  2. BeautifulSoup: For parsing HTML and extracting data from web pages.
  3. Pandas: For data manipulation and cleaning.
  4. MySQL: First target database for storing the cleaned data.
  5. PostgreSQL: Second target database for storing the cleaned data.
  6. MSSQL Server: Third target database for storing the cleaned data.
  7. Docker: Containerization environment, ensuring consistency across different systems

Usage : Docker file

Ensure you have docker, docker compose and its requirements installed in your system.

  1. Clone the repository
git clone https://github.com/Samuel-Njoroge/films-revenue-data-scrapping
cd films-revenue-data-scrapping
  1. Start the docker containers by running the docker-compose.yml
docker compose up

Contributing

Contributions to improve the project are welcome!

Please feel free to fork the repository and submit a pull request.

films-revenue-data-scrapping's People

Contributors

samuel-njoroge avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.