Giter Site home page Giter Site logo

jianlai-ng / webscraping_professionalnetworkingsites_companyreviewsites Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 16 KB

Web Scraping Professional Networking Sites and Company Review Sites for company profile and online background check in Fraud Detection Project to flag Shell Companies

Jupyter Notebook 100.00%

webscraping_professionalnetworkingsites_companyreviewsites's Introduction

webscraping_professionalNetworkingSites_companyReviewSites

Web Scraping Professional Networking Sites and Company Review Sites for company profile and online background check in Fraud Detection Project to flag Shell Companies

Company_Check.ipynb contains a technical demo on how a company background, online presence and network is being checked and verified.

Main Function of this notebook is written and run as-

cred_check(company_name,n1,n2,n3), taking a company name as input and giving 'Fail to Authenticate', 'Low', 'Mid', or 'High' as output for level of credibility of company.

-'Fail to Authenticate' if url to company official site are not the same across sources or Company CEO cannot be verified with google search from site(s) aside from its source
-'Low' if the aforementioned criteria are fulfilled
-'Mid' if the aforementioned criteria are fulfilled, awards stated on professional networking site(s) can be verified with google search from site(s) aside from its source, and size of network of employees on professional networking site(s) is > n1
-'High' if the aforementioned criteria are fulfilled, company reviews from company review site(s) have similarities processed and measured using NLP and Sørensen–Dice <n2, and job openings stated on professional networking site(s) and company review site(s) >n3 

n1,n2,n3 are set to default at 100, 0.5 and 20 respectively without explicit input. n1,n2,n3 to be determined after mass scraping and analysis of trends across the different industries.

To run the notebook successfully, please make sure that the following criteria are fulfilled.

  1. packages in requirements.txt are installed (Selenium used due to the credential requirements)
  2. chromedriver is within the same directory as Company_Check.ipynb
  3. cred.txt is within same directory as Company_Check.ipynb has -* email, //account email for Professional Network Site and Company Review Site -* password, //account password for Professional Network Site and Company Review Site

*Code provided demonstrates a method of low overhead to access such information, do obtain agreement from these sites to run scraping activities labelled Function A and B in navigational header within the notebook.

webscraping_professionalnetworkingsites_companyreviewsites's People

Contributors

jianlai-ng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.