Giter Site home page Giter Site logo

job-pulse-pure-scrape's Introduction

JobPulse.fyi Open Source Web Scraper

JobPulse.fyi is a powerful tool that tracks software engineering and product manager openings tailored for students. This repository is a part of the JobPulse.fyi project and is designed to scrape job information from company websites using Google's API.

Features

  • Job search: Given a query and a website, the scraper searches for job listings that match the query.
  • Data extraction: The scraper visits each job listing page and extracts relevant data, such as job title, years of experience, company, application link, location, and job description.

Getting Started

Prerequisites

  • Python 3.7 or above
  • Packages: BeautifulSoup, selenium, pytz, requests
  • Google API key
  • OpenAI API key

Installation

  1. Clone this repository:

  2. Install the required packages:

    pip install -r requirements.txt
  3. Get a Google API Key:

  4. Get an OpenAI API Key:

    • Follow the steps from OpenAI to get an API key.
  5. Set the environment variables:

    • Copy the .env.example file to a new file named .env and fill in the appropriate keys:

      GOOGLE_API_KEY=your_google_api_key
      CX_KEY=your_cx_key
      OPENAI_KEY=your_openai_key

Usage

  1. Modify the query and site variables in the main function as per your requirements.

  2. Run the code:

    python3 src/main.py --run_pure

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Please feel free to contact us if you have any questions about the project.

Join us on Discord: Discord Link

Happy Coding!

This README is subject to updates, please stay tuned for any changes.

jobPosting schema class Mandatory:

  • apply_link: str
  • company: str
  • date_added: str
  • title: str

Optional:

  • description: str
  • location: str
  • category: "Software Engineer"
  • title_correct_by_gpt: True

job-pulse-pure-scrape's People

Contributors

alex-wengg avatar an-bluecat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

job-pulse-pure-scrape's Issues

manually scrape company site for opening

manual scraping idea. We implemented this but it doesn't work for all companies. If you have better idea to implement this, contact me.
Currently we are trying to do an one solution fits all code:

`

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time

def extract_links(url):
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.get(url)
time.sleep(3)

links = []
last_height = driver.execute_script("return document.body.scrollHeight")  # Get scroll height

while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10.1)  # Wait for the dynamic content to load

    # Extract links after scroll
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for a in soup.find_all('a', href=True):
        link_text = a.text
        link_url = a['href']
        links.append((link_text, link_url))
    
    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")

    if new_height == last_height:  # If heights are the same it will exit the function
        break

    last_height = new_height

    try:
        next_button = driver.find_element_by_link_text('Next')  # Or use appropriate locator
        next_button.click()
        time.sleep(1)  # Wait for the next page to load
    except Exception as e:
        print(f"Exception while clicking next button: {e}")
        

driver.quit()

return links

def main():
url = "https://nvidia.wd5.myworkdayjobs.com/NVIDIAExternalCareerSite"
url = "https://careers.google.com/jobs/results/?distance=50&employment_type=FULL_TIME&employment_type=PART_TIME&employment_type=TEMPORARY&jex=ENTRY_LEVEL&q=Software%20Engineer"
url = "https://www.amazon.jobs/en/business_categories/student-programs?offset=0&result_limit=10&sort=relevant&category%5B%5D=software-development&job_type%5B%5D=Full-Time&distanceType=Mi&radius=24km&latitude=&longitude=&loc_group_id=&loc_query=&base_query=&city=&country=&region=&county=&query_options=&"
url = "https://www.metacareers.com/careerprograms/students/?p[teams][0]=Internship%20-%20Engineering%2C%20Tech%20%26%20Design&p[teams][1]=Internship%20-%20Business&p[teams][2]=Internship%20-%20PhD&p[teams][3]=University%20Grad%20-%20PhD%20%26%20Postdoc&p[teams][4]=University%20Grad%20-%20Engineering%2C%20Tech%20%26%20Design&p[teams][5]=University%20Grad%20-%20Business&teams[0]=Internship%20-%20Engineering%2C%20Tech%20%26%20Design&teams[1]=Internship%20-%20Business&teams[2]=Internship%20-%20PhD&teams[3]=University%20Grad%20-%20PhD%20%26%20Postdoc&teams[4]=University%20Grad%20-%20Engineering%2C%20Tech%20%26%20Design&teams[5]=University%20Grad%20-%20Business#openpositions"
url = "https://careers.microsoft.com/students/us/en/search-results"
url = "https://careers.google.com/jobs/results/"
url = "https://careers.airbnb.com/"
url = 'https://www.tesla.com/careers/search/?query=engineer&type=1&region=5'
url = "https://www.tesla.com/careers/search/?query=engineer&type=1&region=5&site=US"
url = "https://jobs.apple.com/en-us/search?location=united-states-USA&sort=newest&search=software%20engineer"
url = "https://jobs.apple.com/en-us/search?location=united-states-USA&sort=newest&search=software%20engineer"
url = "https://tenstorrent.com/careers/"
links = extract_links(url)

for link in links:
    if "new grad" in link[0].lower() or "early career" in link[0].lower() or "campus" in link[0].lower() or "recent grad" in link[0].lower() or "recent graduate" in link[0].lower():

        print(f"Link text: {link[0]}, URL: {link[1]}")

if name == "main":
main()

`

but maybe we should do it company by company.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.