Giter Site home page Giter Site logo

peppapig450 / fashioncrawler Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 344 KB

FashionCrawler is a versatile Python tool designed for scraping data from various online shopping platforms including Grailed, Depop, and more.

License: Apache License 2.0

Python 95.83% Jinja 2.95% CSS 1.22%
data-extraction fashion python web-scraping

fashioncrawler's Introduction

Web Scraper for Fashion Marketplace Sites

A Python tool for scraping multiple shopping websites such as Grailed, Depop, GOAT, and STOCKx (maybe more).

Table of Contents

Introduction

This project aims to provide a convenient interface to scraping product listings and related data from various online shopping platforms.

This originated from my AP Computer Science Principles project which was just a Grailed scraper, and I wished to expand it to more sites so I created this. The original is here.

Project Plan

To-Do List / Possible Features:

  • Implement logging

  • Implement Depop data extraction and scraping.

  • Figure out how we're gonna handle the respective scrapers. Line 10

  • Refactor directory structure to the type found here

  • Figure out a way to visualize the data (Html)

  • Feature to specify how many items we want to scrape (command line and config file)

  • Implement Stockx data extraction and scraping.

  • Instead of scraping Stockx for market data use their api. (maybe use go for speed)

  • Options to filter the dataframe by a category

  • Process the outputted files and filter or maybe display visually

  • Add headless mode and Print progress updates to stdout

  • Keep poetry and requirements.txt synchronized

Installation

Install using Poetry (recommended):

# clone repository
git clone https://github.com/peppapig450/FashionCrawler

# switch to directory
cd FashionCrawler

# install dependencies
poetry install

Install using a virtual environment:

# clone repository
git clone https://github.com/peppapig450/FashionCrawler

# switch to directory
cd FashionCrawler

# setup and activate virtual environment
python3 -m venv venv && source venv/bin/activate

# install dependencies
pip install -r requirements.txt

Usage

Below are the available options for running the scraper.

Options:

Site Selection:

  • By default, all supported sites are enabled, or it uses the sites specified in the config.yaml file.
  • --enabled-site ENABLE_SITE: Enable specific site(s) by providing a comma-seperated list of supported site names.
  • --disabled-site DISABLE_SITE: Disable specific site(s) by providing a comma-seperated list of supporte site names.

Search Options:

  • -s SEARCH, --search SEARCH: Specify a search query to scrape for.

Output Options:

  • If no output option is specified, the scraper prints the result as a table on the command line.
  • -j, --json: Output the result as JSON.
  • -c, --csv: Output the result as CSV.
  • -y, --yaml: Output the result as YAML.
  • -o OUTPUT, --output OUTPUT: Specify the output file name (without extension).
  • --output-dir OUTPUT_DIR: Specify the output directory.

Example Usage:

To enable only Grailed and Depop sites, search for "Nike Air Force", and output the result as JSON to a file named "output.json" in the "data" directory, the command would be:

poetry run python main.py --enable-site Grailed,Depop --search "Nike Air Force" -j -o output --output-dir data

License

Apache License 2.0

fashioncrawler's People

Contributors

peppapig450 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

fashioncrawler's Issues

Setting up CI/CD Environment with Poetry

Description:

To enhance my development process, I'm initiating the setup of a CI/CD environment integrated with Poetry. This CI/CD pipeline will automate the build, test, and deployment processes, ensuring efficiency and consistency in my development workflow. The following tasks need to be addressed:

  1. CI/CD Tool Selection: Research and select a CI/CD tool compatible with Poetry and tailored for our project requirements.
  2. Pipeline Configuration: Define the stages and steps of the CI/CD pipeline, including building, testing, and deploying projects using Poetry.
  3. Integration with GitHub: Integrate the CI/CD pipeline with our Github to trigger builds automatically upon code changes.
  4. Artifact Management: Establish a process for managing built artifacts generated by the CI/CD pipeline, ensuring traceability and reproducibility.
  5. Monitoring and Alerts: Set up monitoring and alerting mechanisms to detect and respond to failures or issues in the CI/CD pipeline.

Tasks:

  • Research and evaluate CI/CD tools compatible with Poetry and tailored for our project requirements.

  • Configure the CI/CD pipeline to automate the build, test, and deployment processes using Poetry.

  • Integrate the CI/CD pipeline with our version control system to trigger builds on code changes.

  • Establish artifact management practices to store and track built artifacts generated by the CI/CD pipeline.

  • Implement monitoring and alerting mechanisms to ensure the reliability and stability of the CI/CD pipeline.

HTML and PDF output

Add the ability to output the information scraped by the application in both HTML and PDF formats. This feature will enhance the usability of the application by allowing users to view and share scraped data in a more accessible and versatile manner.

Proposed Implementation:

  1. HTML Output:

    • Utilize Jinja2 templating engine to generate HTML output from the scraped data.
    • Design HTML templates to present the scraped information in a structured and visually appealing format. See image below
    • Ensure compatibility with modern web browsers and responsive design principles for optimal viewing experience across devices.
  2. PDF Output:

    • Convert the rendered HTML output to PDF format for offline viewing and sharing.
    • Investigate suitable libraries or tools for converting HTML to PDF, considering factors such as performance, customization options
    • Implement a solution that seamlessly integrates with the application and provides high-quality PDF output.

Use Cases:

  • Users can generate HTML reports containing the scraped data for easy viewing and sharing via web browsers.
  • Users can export the HTML reports to PDF format for offline access or distribution.

Requirements:

  • Ensure compatibility with common web browsers and PDF viewers.
  • Support customization options for the HTML templates, such as styling and layout configurations.
  • Implement error handling mechanisms to gracefully handle edge cases during HTML to PDF conversion.

Dependencies:

  • Investigate and select a suitable library or tool for HTML to PDF conversion.
  • Ensure compatibility with existing data scraping functionality and data processing pipeline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.