Giter Site home page Giter Site logo

peppapig450 / fashioncrawler Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 344 KB

FashionCrawler is a versatile Python tool designed for scraping data from various online shopping platforms including Grailed, Depop, and more.

License: Apache License 2.0

Python 95.83% Jinja 2.95% CSS 1.22%
data-extraction fashion python web-scraping

fashioncrawler's Introduction

Web Scraper for Fashion Marketplace Sites

A Python tool for scraping multiple shopping websites such as Grailed, Depop, GOAT, and STOCKx (maybe more).

Table of Contents

Introduction

This project aims to provide a convenient interface to scraping product listings and related data from various online shopping platforms.

This originated from my AP Computer Science Principles project which was just a Grailed scraper, and I wished to expand it to more sites so I created this. The original is here.

Project Plan

To-Do List / Possible Features:

  • Implement logging

  • Implement Depop data extraction and scraping.

  • Figure out how we're gonna handle the respective scrapers. Line 10

  • Refactor directory structure to the type found here

  • Figure out a way to visualize the data (Html)

  • Feature to specify how many items we want to scrape (command line and config file)

  • Implement Stockx data extraction and scraping.

  • Instead of scraping Stockx for market data use their api. (maybe use go for speed)

  • Options to filter the dataframe by a category

  • Process the outputted files and filter or maybe display visually

  • Add headless mode and Print progress updates to stdout

  • Keep poetry and requirements.txt synchronized

Installation

Install using Poetry (recommended):

# clone repository
git clone https://github.com/peppapig450/FashionCrawler

# switch to directory
cd FashionCrawler

# install dependencies
poetry install

Install using a virtual environment:

# clone repository
git clone https://github.com/peppapig450/FashionCrawler

# switch to directory
cd FashionCrawler

# setup and activate virtual environment
python3 -m venv venv && source venv/bin/activate

# install dependencies
pip install -r requirements.txt

Usage

Below are the available options for running the scraper.

Options:

Site Selection:

  • By default, all supported sites are enabled, or it uses the sites specified in the config.yaml file.
  • --enabled-site ENABLE_SITE: Enable specific site(s) by providing a comma-seperated list of supported site names.
  • --disabled-site DISABLE_SITE: Disable specific site(s) by providing a comma-seperated list of supporte site names.

Search Options:

  • -s SEARCH, --search SEARCH: Specify a search query to scrape for.

Output Options:

  • If no output option is specified, the scraper prints the result as a table on the command line.
  • -j, --json: Output the result as JSON.
  • -c, --csv: Output the result as CSV.
  • -y, --yaml: Output the result as YAML.
  • -o OUTPUT, --output OUTPUT: Specify the output file name (without extension).
  • --output-dir OUTPUT_DIR: Specify the output directory.

Example Usage:

To enable only Grailed and Depop sites, search for "Nike Air Force", and output the result as JSON to a file named "output.json" in the "data" directory, the command would be:

poetry run python main.py --enable-site Grailed,Depop --search "Nike Air Force" -j -o output --output-dir data

License

Apache License 2.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.