Giter Site home page Giter Site logo

resume-categorizer's Introduction

New Plan!

Reason of starting from scratch

As you can see, there's a folder here called "legacy". That contains logics and codes I wrote for this open source resume categorizer previously. That version 1 was working fine and was correctly categorizing resumes, generating CSV for HR to look at. However, as you would know by reading its README, it's super complicated despite that we already had technologies to make them very simple. That was 2017 and back when I had very little knowledge with Python. I also didn't know how Docker works so containerizing the whole process wasn't my option. I could learn, but I probably was not capable of understanding what Docker was. Now it's 2020 and I finally have plenty of experience with Python, as well as Docker. I decided to maintain this repository and realized that just setting this whole thing up would take years, trouble shooting even longer. So I decided to start from scratch. My goal is to make this repo usable with few commands and to dockerize.

So far

  1. Install poppler (for pdf2image)
  2. Install python library pdf2image (https://github.com/Belval/pdf2image)
  • Needed because pytesseract accepts images
  1. Install tesseract (for pytesseract)
  2. Install python library pillow (for pytesseract)
  3. Install python library pytesseract
  4. Install python library spacy
  5. Install en_core_web_sm from spacy
  6. Install Django

General Idea

  • Need to have solution ready to be used by user with few commands
  • Dockerize!

Process

  1. PDF to JPEG
  • poppler & pdf2image
  1. OCR on JPEG
  • pytesseract
  • This takes quite time. I could use async job for this work.
  1. Analyse the resulting text and populate the CSV file with found elements

Things to think about

  • Find a way to accept file from GUI
    • Replace RDBMS. This is not a good solution.
  • Used to use Google NLP to identify which text is what. Hopefully there's a way to do this without Google :(

resume-categorizer's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

resume-categorizer's Issues

Email extraction

What's up with email extraction? There are many cases with no email. It needs to be debugged...

Optimization on locating names

Extraction name function needs to be optimized for...

  1. Using too much computer power
  • Google NLP
  • Stanford Core NLP
  1. Using too many lists and loops
  • Possible optimization points

Better way of interacting with users

  • Currently, it requires a user to view ONLY uploaded files through their browsers.
  • Is there a way for users to download a folder containing a resulting excel file (or HTML file) with all the resumes in the folder so they could interact on their computer instead of only over a browser?

Docker

Need to use Docker in running/ setting up Clowder and Stanford Core NLP. Clowder requires too many configuring to be used. This process needs to be automated

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.