job-pulse-pure-scrape's Introduction

JobPulse.fyi Open Source Web Scraper

JobPulse.fyi is a powerful tool that tracks software engineering and product manager openings tailored for students. This repository is a part of the JobPulse.fyi project and is designed to scrape job information from company websites using Google's API.

Features

Job search: Given a query and a website, the scraper searches for job listings that match the query.
Data extraction: The scraper visits each job listing page and extracts relevant data, such as job title, years of experience, company, application link, location, and job description.

Getting Started

Prerequisites

Python 3.7 or above
Packages: BeautifulSoup, selenium, pytz, requests
Google API key
OpenAI API key

Installation

Clone this repository:
Install the required packages:
```
pip install -r requirements.txt
```
Get a Google API Key:
- Follow the steps from Google Custom Search JSON API to obtain a Google API key and a Search Engine ID (cx key).
Get an OpenAI API Key:
- Follow the steps from OpenAI to get an API key.
Set the environment variables:
- Copy the .env.example file to a new file named .env and fill in the appropriate keys:
```
GOOGLE_API_KEY=your_google_api_key
CX_KEY=your_cx_key
OPENAI_KEY=your_openai_key
```

Usage

Modify the query and site variables in the main function as per your requirements.
Run the code:
```
python3 src/main.py --run_pure
```

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Please feel free to contact us if you have any questions about the project.

Join us on Discord: Discord Link

Happy Coding!

This README is subject to updates, please stay tuned for any changes.

jobPosting schema class Mandatory:

apply_link: str
company: str
date_added: str
title: str

Optional:

description: str
location: str
category: "Software Engineer"
title_correct_by_gpt: True

links = [] last_height = driver.execute_script("return document.body.scrollHeight") # Get scroll height while True: # Scroll down to the bottom of the page driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(10.1) # Wait for the dynamic content to load # Extract links after scroll soup = BeautifulSoup(driver.page_source, 'html.parser') for a in soup.find_all('a', href=True): link_text = a.text link_url = a['href'] links.append((link_text, link_url)) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") if new_height == last_height: # If heights are the same it will exit the function break last_height = new_height try: next_button = driver.find_element_by_link_text('Next') # Or use appropriate locator next_button.click() time.sleep(1) # Wait for the next page to load except Exception as e: print(f"Exception while clicking next button: {e}") driver.quit() return links

for link in links: if "new grad" in link[0].lower() or "early career" in link[0].lower() or "campus" in link[0].lower() or "recent grad" in link[0].lower() or "recent graduate" in link[0].lower(): print(f"Link text: {link[0]}, URL: {link[1]}")

Recommend Projects

job-pulse / job-pulse-pure-scrape Goto Github PK

job-pulse-pure-scrape's Introduction

JobPulse.fyi Open Source Web Scraper

Features

Getting Started

Prerequisites

Installation

Usage

License

Contact

job-pulse-pure-scrape's People

Contributors

Stargazers

Watchers

Forkers

job-pulse-pure-scrape's Issues

Recommend Projects

Recommend Topics

Recommend Org