Giter Site home page Giter Site logo

mrtrkmn / orbi Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 11.1 MB

This repository is created to keep files updated for IDP in The Dr. Theo Schöller Chair of Technology and Innovation Management

Home Page: https://orbi.mrturkmen.com

Python 99.28% Shell 0.72%
idp

orbi's Introduction

Crawl Data

This is a simple crawler that crawls data from two websites currently:

for company and patent related data.

  • main branch: [possible] sync with run-on-github branch.
  • run-on-self-hosted: runs on self-hosted computer - updated frequently than main branch. Create PR to main to receive updates.

How to run

./orbi contains the main script which is used to run the crawler. It contains two different classes to crawl data from the websites.

  • Class Crawler: This class is used to crawl data from ipo.gov.uk and sec.gov website. It is used to generate csv file with; name, city, country and CIK number of the companies. Name, city and country are scraped from sec.gov website.(CIK number of the companies provided by a xlsx file)

  • Class Orbis: This class is responsible for crawling data from Orbis database. It is using batch search to find the companies from the csv file generated by Crawler class. It also adds/removes columns to enchance the search results and export the results to xlsx file.

All process is automated by using selenium and chromedriver.

On Local Dev Machine

Since on Github actions, the script is using environment variables, it is required to have the environment variables set on your local machine. Providing all environment variables through commandline would be a bit tedious, so I have created a config file which is used by the script to load the environment variables. Check sample config file from here.

  • Setup the requirements:
$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
  • There are three main components of the program, which are orbi/orbi.py, orbi/crawl.py and utils/visualize.py.

    • orbi/orbi.py is the main script which is used to start batch search on Orbis database by generated csv file from given XLSX file.

    • orbi/crawl.py is the script which is used to crawl data from sec.gov website.

    • utils/visualize.py is the script which is used to visualize the data.

Orbi (batch search on orbis database)

This part explains running it on local machine. For running it on remote, check out the On Remote section.

  • After setting up virtual environment, and installing requirements you can run the script by running the following command to start batch search on local machine. Before starting the process, make sure there is a config.yaml file in config folder, which includes all required credentials.
$ LOCAL_DEV=True CONFIG_PATH=./config/config.yaml CHECK_ON_SEC=False python orbi/orbi.py
  • The given command above will launch chrome browser not in headless mode, and will not check the companies on sec.gov website. It will take cleaned company names, merge it in one column, then feed it to Orbis database to get the results.

  • To add data from sec.gov website, you can set CHECK_ON_SEC=True in the command above. ( In our experiment this decreases hit rate of companies, preferred to be not used. This is added at first stages of the project and later discovered that it is not necessary. )

Make sure that you are defining the path to the config file correctly.

Crawl (scraping data from sec.gov website)

  • Crawl class is used to scrape data from sec.gov website. It makes requests to API endpoint (https://data.sec.gov/api/xbrl/companyfacts/CIK##########.json) where company's financial figures are given as JSON response. From this point, based on licensee agreement date for all companies financial figures are fetched and saved to a csv file.

  • Not found companies and missing KPI values are stored in a seperate file.

  • To run the crawler, you can run the following command:

$ python orbi/crawl.py 

  Example usage:

    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee  # searching over licensee information 
    python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --no-licensee # searching over licensor information 
            
  • --source_file: (required) is the path to the input file. The same file which is used as input file ( XLSX) for Orbis batch search.

  • --output_file: (optional) is the path to the output file. If not provided, it will be saved to ./data/ folder with the name company_facts_{timestamp}_licensee.csv or "company_facts_{timestamp}_licensor.csv.

  • --licensee: (required) is the boolean value to indicate if the source file is for licensee.

  • --no-licensee (required) is the boolean value to indicate if the source file is for licensor.

Example call for licensee field:

python orbi/crawl.py --source_file sample_data.xlsx --output_file company_facts.csv --licensee 

On Remote

  • The action can be triggered through actions tab on Github. Right side of the page, you can see the 'Run workflow' button to trigger the action.

To run the crawler classs seperately , check out the commented code in ./orbi/crawl.py` file.

Specifically, this line: ./orbi/orbi.py#494

Automation of Orbis database access and batch search on Orbis database

  • orbi.py can access Orbis database, execute batch search by providing the csv file generated by Crawler class, add/remove columns to enchance the search results and export the results to csv file.

  • Currently orbi.py file will produce following files, in order to download them you can use the link send to Slack, and append with following file names below.

Produced files by Orbi class
orbis_aggregated_data_{timestamp}.csv : example --> orbis_aggregated_data_13_01_2023.csv
orbis_aggregated_data_{timestamp}.xlsx : example --> orbis_aggregated_data_13_01_2023.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx : example --> orbis_aggregated_data_licensee_14_01_2023.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx : example --> orbis_aggregated_data_licensor_14_01_2023.xlsx
orbis_data_licensee_{timestamp}.csv : example --> orbis_data_licensee_14_01_2023.csv
orbis_data_licensee_14_01_2023.xlsx : example --> orbis_data_licensee_14_01_2023.xlsx
orbis_data_licensee_guo_{timestamp}.csv : example --> orbis_data_licensee_guo_14_01_2023.csv
orbis_data_licensee_guo_{timestamp}.xlsx : example --> orbis_data_licensee_guo_14_01_2023.xlsx
orbis_data_licensee_ish_{timestamp}.csv : example --> orbis_data_licensee_ish_14_01_2023.csv
orbis_data_licensee_ish_{timestamp}.xlsx : example --> orbis_data_licensee_ish_14_01_2023.xlsx
orbis_data_licensor_{timestamp}.csv  : example --> orbis_data_licensor_14_01_2023.csv
orbis_data_licensor_{timestamp}.xlsx : example --> orbis_data_licensor_14_01_2023.xlsx
orbis_data_licensor_guo_{timestamp}.csv : example --> orbis_data_licensor_guo_14_01_2023.csv
orbis_data_licensor_guo_{timestamp}.xlsx : example --> orbis_data_licensor_guo_14_01_2023.xlsx
orbis_data_licensor_ish_{timestamp}.csv : example --> orbis_data_licensor_ish_14_01_2023.csv
orbis_data_licensor_ish_{timestamp}.xlsx : example --> orbis_data_licensor_ish_14_01_2023.xlsx
- sample_data.xlsx
  • Data is accessible through: link + file name
Produced files by Crawler class
orbis_aggregated_data_{timestamp}.csv 
orbis_data_licensee_{timestamp}.csv
orbis_data_licensee_guo_{timestamp}.csv
orbis_data_licensee_ish_{timestamp}.csv
orbis_data_licensor_{timestamp}.csv
orbis_data_licensor_guo_{timestamp}.csv
orbis_data_licensor_ish_{timestamp}.csv
  • The XLSX files are generated through the ./orbi/orbi.py by conducting batch search on Orbis database.
Produced XLSX files by Orbi class - END RESULT -
orbis_aggregated_data_{timestamp}.xlsx
orbis_aggregated_data_licensee_{timestamp}.xlsx
orbis_aggregated_data_licensor_{timestamp}.xlsx
orbis_data_licensee_{timestamp}.xlsx
orbis_data_licensee_guo_{timestamp}.xlsx
orbis_data_licensee_ish_{timestamp}.xlsx
orbis_data_licensor_{timestamp}.xlsx
orbis_data_licensor_guo_{timestamp}.xlsx
orbis_data_licensor_ish_{timestamp}.xlsx

Slack Integration

Currently, action results are uploaded to AWS S3 service and accesible with the link sent to private Slack channel. The files can be downloaded as decribed in the slack channel.

Run orbi from Slack

Orbi can be triggered on Github from slack when you are in tum-tim.slack.com workspace.

Any user who writes in the message field of Slack the following command and press 'Enter', Orbi will start the process on Github:

/run-orbis-crawler 

You will receive a result as shown below from Slack.


how-to-run-orbi-from-slack


After it is initialized, you will receive a message to #idp-data-c channel on Slack similar to the following:


Initial Notification


When it is done successfully, you will have a new notification with the link which provides access to data that similar to following:

Screenshot 2023-03-01 at 13 47 46


In case of error on the process, similar notification will be received as provided below:

Error notification

Main Workflow

Beside the given main workflow given below, there are other options which can be used with this repository.

The workflow is subject to change in time.

Batch Search Flow Chart

The following flow chart shows the process of batch search done by Orbi.

Flowchart of the batch search functionality of Orbi.

orbi's People

Contributors

mrtrkmn avatar github-actions[bot] avatar robotcuk avatar

Stargazers

 avatar

Watchers

 avatar

orbi's Issues

Fix warning message for future dev

home/thinkpad/Desktop/github-self-hosted/actions-runner/_study/orbi/orbi/orbi/orbi.py:1924: FutureWarning: The default value of regex will change from True to False in a future version.
132
df = df.apply(lambda x: x.str.replace(r"\r", ""))
133
/home/thinkpad/Desktop/github-self-hosted/actions-runner/_study/orbi/orbi/orbi/orbi.py:1924: FutureWarning: The default value of regex will change from True to False in a future version.
134
df = df.apply(lambda x: x.str.replace(r"\r", ""))

add new variables to result

  • Company name Latin alphabet
  • Country ISO code
  • BvD ID number
  • Orbis ID number
  • Operating revenue (Turnover) USD
  • Sales USD
  • Gross profit USD
  • Operating P/L [=EBIT] USD
  • P/L before tax USD
  • P/L for period [=Net income] USD
  • Cash flow USD
  • Total assets USD
  • Trade description (English)
  • Currency
     
  • BvD sectors
  • US SIC, primary code(s)
  • US SIC, secondary code(s)
  • Number of employees

run search on ISH field as well

Currently we are conducting a search through GUO and Company Name, Licensee, Licensor.
It would be nice to have the batch search through ISH field.

acquisitions might lead to have different values in financial data

It might be the case that some companies are acquired and even though we are interested on agreement date financial data for the company before the acquisition, Orbis database might provide financial figures which are belongs to the company who acquired.

Just double check this information to be sure that nothing is mixed up for financial figures.

do the batch search for all licensee information provided in the input file

Traverse over licensee information !!

check latest input file

Entry Licensee Licensee CIK Licensee 1_cleaned Licensee CIK 1_cleaned Comment Licensee 2_cleaned Licensee CIK 2_cleaned Comment Licensee 3_cleaned Licensee CIK 3_cleaned Comment Licensee 4_cleaned Licensee CIK 4_cleaned Comment Licensor Licensor CIK Licensor 1_cleaned Licensor CIK 1_cleaned Comment Licensor 2_cleaned Licensor CIK 2_cleaned Comment Licensor 3_cleaned Licensor CIK 3_cleaned Comment Licensor 4_cleaned Licensor CIK 4_cleaned Comment Date

ask orbis about after acquisition data

We have seen that acquisition leads to have different financial information when we look at the year of the license agreement. It would be nice to have the data based on the company before acquisition.

Current financial information belongs to the company who merged the company which we are looking for.

history information about m&a activities

Good to have M&A activities for the companies.

In brief search, we realized that it is possible to have that information under

Stock and Earnings Estimates > Annual Stock Data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.