Giter Site home page Giter Site logo

victornguli / skraped Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 6.15 MB

A CLI tool for scraping Kenyan job boards for tech jobs and saving them in a csv format for easier aggregation and job search

Python 100.00%
csv scraping web-scraping pickle cli jobs-scraper job-board-scraper python3 requests beautifulsoup4 lxml kenya-job-board-scraper kenya-jobs-scraper scraper scrapes skraper

skraped's Introduction

Skraped!

Coverage Status Build Status

Scraped is a command line tool that scrapes and aggregates job from several Kenyan job sites. These scraped job posts are saved in a csv file that allows a job seeker to conveniently review jobs from several sources.

alt text

Getting Started

These instructions will get you a copy of Skraper running on your local machine for development and testing purposes.

Dependencies

Skraper requires Python 3.6.0 or a later version. Other library dependencies have been declared within setup.py and will be automatically installed when installing the project using setup

Installation

pip install git+https://github.com/Victornguli/Skraped.git
skraper --help

or for testing purposes you can first clone the repository then install skraped via setup.py

git clone https://github.com/Victornguli/Skraped.git skraped
cd skraped
python setup.py install
skraper --help

Using Skraped

To run a scrape session initialize the scraper with the keywords arguments and an output path(Where the scrape data and pickle backups will be saved) NOTE THAT the output directory is relative to the directory you are running skraper from, unless explicitly declared as absolute like /home/{{username}}/jobsearch

skraper -o devjobs -kw "Software developer"

This will run and create a directory devjobs relative to current directory, scrape the sites enabled on skraped/config/settings.yaml and save the scraped data in a csv file as well as pickle format backups

Extra Configuration

To customize your experience you can supply additional arguments when running the scraper or define them in a settings.yaml file located in skraped/config/settings.yaml. You can override this file by copying the default file, updating it with your own configuration values and running the scraper with -s flag supplying the location of the config file. Example

skraper -o /home/{{your_username}}/jobsearch -s /home/{{your_username}}/jobsearch/custom_settings.yaml

Under yaml settings file you can configure the following options

Option Type Description
output_path string Path where scraped data will be saved
sources List[string] List of sources to be scraped. Comment out a source to ignore it when running the scraper
keywords string The keywords to be used when running the job searches from the sources
delay boolean If enabled will add a reasonable or random delay to the spiders to avoid rate limiting

Local development

If you use the setup.py install method, you will need to run python setup.py install to update the installed binary on ANY SOURCE CODE CHANGES so that the command skraper can run on the latest version within your virtual-environment

Goodluck in your job search :)

skraped's People

Contributors

dependabot[bot] avatar victornguli avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

toretawanda

skraped's Issues

Job details processing raises TypeError.

Job details that cannot be processed are evaluated to None and when appended in the scraped_data list will cause a TypeError when merge_scrape_data method tries to extract details like title from the None object.

Fix
Simple fix seems to work by explicitly checking if the returned job details is not None before appending it to the scrape_data list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.