Giter Site home page Giter Site logo

epam / osci Goto Github PK

View Code? Open in Web Editor NEW
151.0 14.0 91.0 1.12 MB

Open Source Contributor Index

Home Page: https://opensourceindex.io/

License: GNU General Public License v3.0

Python 98.91% Dockerfile 0.15% HTML 0.95%
open-source analytics pyspark python azure-functions

osci's Introduction

OSCI Logo

Open Source Contributor Index (OSCI)

OSCI, an open source project, aiming to track and measure open source activity on GitHub by commercial organizations. It allows organizations, communities, analysts and individuals involved in Open Source to get insights about contribution trends among commercial organizations by providing access to up-to-date data through an intuitive interface.

OSCI Working_Group

Table of contents

How does OSCI work?

To create this index, the system processes GitHub push events data from GH Archive:

GitHub OSCI Schematic Diagram

OSCI tracks two measures at each organization:

  • Active contributors, the number of people who authored 10 or more commits over a period of time
  • Total community, the number of people who made at least one commit over a period of time

How are commit authors linked to commercial organizations?

The system uses email domain of the commit author to identify the organization. Your organization is missing in the ranking? Feel free to add your organization to the list.

Note: OSCI does not rank open source activity contributed by universities, research institutions and individual entrepreneurs.

How can I submit my company for ranking?

  1. Check whether the organization you propose to add matches OSCI definition:

    • not an educational, governmental, non-profit or research institution;
    • registered, commercial organization;
    • sells goods or services for the purpose of making a profit.
  2. Create a new pull request.

  3. Go to company domain match list (company_domain_match_list.yaml)

  4. Double check that the organization you want to add is not listed.

  5. Add the email domain of the company and the company name to the table. For example:

    - company: Facebook
      domains:
        - fb.com
      regex:
  6. If the company has more than 1 email domain for its employees, add all of them to block domains (or regex for using regular expression). For example:

    - company: Facebook
      domains:
        - fb.com
        - facebook.com
      regex:
        - ^.*\.fb\.com$
        - ^.*\.facebook\.com$
  7. Select the industry to which your company belongs from the following list:

    • Automotive;
    • Banking, Insurance & Financial Services;
    • Education;
    • Energy & Utilities;
    • Entertainment;
    • Healthcare and Pharma;
    • Professional Services;
    • Public Sector;
    • Retail & Hospitality;
    • Technology;
    • Media & Telecoms;
    • Travel & Transport;
    • Other (please specify);

    For example:

    - company: Facebook
      domains:
        - fb.com
        - facebook.com
      regex:
        - ^.*\.fb\.com$
        - ^.*\.facebook\.com$
      industry: Media & Telecoms

Our team will review your pull request and merge it if everything is correct.

Note: since OSCI processes the data for the previous month, you'll see your organization's rank in the beginning of the next month.

How can I contribute to OSCI?

See CONTRIBUTING.md for details on contribution process.

QuickStart

OSCI is deployed into Azure Cloud environment using Azure DataFactory, Azure Function and Azure DataBricks. However, the code available on GitHub does not require using of Azure Cloud. Run the application from the command line using the instruction below.

Installation

  1. Clone repository
         git clone https://github.com/epam/OSCI.git
  2. Go to project directory
         cd OSCI
  3. Install requirements
         pip install -r requirements.txt

Configuration

Create a file local.yml (by default this file added to .gitignore) in the directory osci/config/files. A sample file default.yml is included, please don't change values in this file

Sample run

  1. Run script to download data from archive (for example for 01 January 2020)
         python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
  2. Run script to add company field (matched by domain) (for example for 01 January 2020)
         python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
  3. Run script to add company field (matched by domain) (for example for 01 January 2020)
         python3 osci-cli.py daily-osci-rankings -td 2020-01-02

OSCI Versioning

For a comprehensive OSCI versioning we adopted the following approach <year>.<month>.<number of patch >) e.g. 2021.05.0. We expect regularly monthly updates including releases associated with submission of a new company for ranking.

License

OSCI is licensed under the GNU General Public License v3.0.

Contact Us

For support or help using OSCI, please contact us at [email protected].

osci's People

Contributors

abitrolly avatar achimnol avatar ashleywolf avatar dependabot[bot] avatar embeddalex avatar irynastr avatar mike-n-jacobs avatar nikos912000 avatar nslsrv avatar patrickstephens1 avatar revfactory avatar richardlitt avatar sfermigier avatar simenandre avatar uliana2019 avatar vlad-isayko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

osci's Issues

Measuring company support for known OSS public projects

While OSCI is primarily a reputation tool, it could actually do some good things if it could provide a ground for companies to support and compete in this user story.

What especially companies can do, if they like to support poetry in a way, is to give there employees a fixed time they can spend to support poetry. Having more people, who can contribute on a regular bases, would help us a lot. They don't necessarily need to start coding. Looking into the issue tracker and finding duplicates or outdated tickets or answer question there is important as well and would give us more time in actually fixing bugs or implement new features.

python-poetry/poetry#4160 (comment)

Data inconsistency or update issues?

Active contributor : query

Hi,
Say if developer X makes 10 commits in Jan 2021 and does another 10 in the next month i.e. Feb 2021 and there is developer Y who makes 10 commits in the month of Feb 2021 only.

Does the active contributor number for the organization show 2 or 3?

I assume it is 2.
Please can someone help confirm.

SmartBear not appearing in the list

As mentioned in #122, I'm puzzled why SmartBear does not appear in the list, now that we're into the month of April.

We were added in we were added in v2022.03.0.

I'm not sure if this is user error on my part, so please let me know if there's something I'm misunderstanding, or we need to make some additional configuration.

Here's an example of a commit that I think should have been counted: https://github.com/cucumber/cucumber-js/commits/d98c6deabd39e1adb8e52a1a65324662143108e8

How to run OSCI in 2023?

Dear team,

i am trying to run a local osci instllation to get some stats. After failing installing tools in ubuntu 22.04 lts, i switched to 20.04 lts and got at least dependencies installed using pip and pyhton 3.8.

Now the first step of the pipeline is failing, i.e. running python3 osci-cli.py get-github-daily-push-events -d 2020-01-02

[2023-07-11 12:09:27,286] [ERROR] Failed to parse json: . Error: Expecting value: line 1 column 1 (char 0) [2023-07-11 12:09:27,329] [INFO] Save push events commits for 2020-01-02 00:00:00 into file /data/landing/github/events/push/2020/01/02/2020-01-02-0.parquet Traceback (most recent call last): File "osci-cli.py", line 93, in <module> cli(standalone_mode=False) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/sten/Desktop/OSCI/osci/actions/base.py", line 59, in execute return self._execute(**self._process_params(kwargs)) File "/home/sten/Desktop/OSCI/osci/actions/load/load.py", line 34, in _execute return get_github_daily_push_events(day=day) File "/home/sten/Desktop/OSCI/osci/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day) File "/home/sten/Desktop/OSCI/osci/datalake/local/landing.py", line 42, in save_push_events_commits log.info(f'Push events commits df info {get_pandas_data_frame_info(df)}') File "/home/sten/Desktop/OSCI/osci/utils.py", line 46, in get_pandas_data_frame_info df.info(buf=buf) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2497, in info mem_usage = self.memory_usage(index=True, deep=deep).sum() File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2590, in memory_usage result = Series(self.index.memory_usage(deep=deep), index=["Index"]).append( File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/series.py", line 305, in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/construction.py", line 465, in sanitize_array subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1452, in construct_1d_arraylike_from_scalar subarr = np.empty(length, dtype=dtype) TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

Would docker be a stable environment to run? My aim is to count github conributions based on some email-regexps.

Thanks!

Unable to get basic example to run

Hey, folks --

I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.

I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.

Any help pointing me in the right direction would be appreciated!

$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)

# ...

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data

# ...

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from

Full Error Log:
[2022-03-22 18:11:05,996] [INFO] ENV: None
[2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists
[2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml
[2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.LocalFileSystemConfig'>
[2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.Config'>
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.datalake.datalake.DataLake'>
[2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings`
[2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}`
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.osci_ranking.OSCIRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.commits_ranking.OSCICommitsRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.jobs.session.Session'>
[2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) []
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2022-03-22 18:11:08,127] [DEBUG] Command to send: A
fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea

[2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,142] [DEBUG] Command to send: j
i
rj
org.apache.spark.SparkConf
e

[2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,143] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.java.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.ml.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.mllib.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.api.python.*
e

[2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,145] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.hive.*
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: j
i
rj
scala.Tuple2
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: r
u
SparkConf
rj
e

[2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf
[2022-03-22 18:11:08,148] [DEBUG] Command to send: i
org.apache.spark.SparkConf
bTrue
e

[2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0
[2022-03-22 18:11:08,154] [DEBUG] Command to send: c
o0
contains
sspark.serializer.objectStreamReset
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,158] [DEBUG] Command to send: c
o0
set
sspark.serializer.objectStreamReset
s100
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1
[2022-03-22 18:11:08,158] [DEBUG] Command to send: m
d
o1
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
contains
sspark.rdd.compress
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
set
sspark.rdd.compress
sTrue
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2
[2022-03-22 18:11:08,159] [DEBUG] Command to send: m
d
o2
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
get
sspark.master
e

[2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,161] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
get
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
contains
sspark.home
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,163] [DEBUG] Command to send: c
o0
getAll
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3
[2022-03-22 18:11:08,163] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,164] [DEBUG] Command to send: a
g
o3
i0
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4
[2022-03-22 18:11:08,164] [DEBUG] Command to send: c
o4
_1
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress
[2022-03-22 18:11:08,165] [DEBUG] Command to send: c
o4
_2
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
g
o3
i1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5
[2022-03-22 18:11:08,166] [DEBUG] Command to send: c
o5
_1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o5
_2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
g
o3
i2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o6
_1
e

[2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master
[2022-03-22 18:11:08,170] [DEBUG] Command to send: c
o6
_2
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
g
o3
i3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: c
o7
_1
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles
[2022-03-22 18:11:08,172] [DEBUG] Command to send: c
o7
_2
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
g
o3
i4
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_1
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_2
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
g
o3
i5
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_1
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_2
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
g
o3
i6
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_1
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_2
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,175] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,175] [DEBUG] Command to send: m
d
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,175] [DEBUG] Command to send: r
u
JavaSparkContext
rj
e

[2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext
[2022-03-22 18:11:08,186] [DEBUG] Command to send: i
org.apache.spark.api.java.JavaSparkContext
ro0
e

[2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11
[2022-03-22 18:11:09,483] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12
[2022-03-22 18:11:09,490] [DEBUG] Command to send: c
o12
conf
e

[2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13
[2022-03-22 18:11:09,500] [DEBUG] Command to send: r
u
PythonAccumulatorV2
rj
e

[2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2
[2022-03-22 18:11:09,502] [DEBUG] Command to send: i
org.apache.spark.api.python.PythonAccumulatorV2
s127.0.0.1
i45879
sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14
[2022-03-22 18:11:09,502] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15
[2022-03-22 18:11:09,503] [DEBUG] Command to send: c
o15
register
ro14
e

[2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,505] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:09,506] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,506] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
ro11
e

[2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,508] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
u
org.apache.spark.SparkFiles
rj
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
m
org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,511] [DEBUG] Command to send: c
z:org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1
[2022-03-22 18:11:09,512] [DEBUG] Command to send: c
o13
get
sspark.submit.pyFiles
s
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys
[2022-03-22 18:11:09,513] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,514] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,516] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,517] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
getLocalDir
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o16
conf
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17
[2022-03-22 18:11:09,520] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
getLocalDir
ro17
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
[2022-03-22 18:11:09,520] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,521] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
createTempDir
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
createTempDir
s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
spyspark
e

[2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
o18
getAbsolutePath
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101
[2022-03-22 18:11:09,525] [DEBUG] Command to send: c
o13
get
sspark.python.profile
sfalse
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse
[2022-03-22 18:11:09,525] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,545] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,567] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19
[2022-03-22 18:11:09,568] [DEBUG] Command to send: c
o19
isDefined
e

[2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,569] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,570] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20
[2022-03-22 18:11:09,571] [DEBUG] Command to send: i
org.apache.spark.sql.SparkSession
ro20
e

[2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21
[2022-03-22 18:11:09,620] [DEBUG] Command to send: c
o21
sqlContext
e

[2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22
[2022-03-22 18:11:09,621] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,622] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setDefaultSession
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,623] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setDefaultSession
ro21
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,623] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,624] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setActiveSession
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,624] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setActiveSession
ro21
e

[2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,625] [DEBUG] Command to send: c
o22
read
e

[2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23
[2022-03-22 18:11:10,432] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:10,433] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym
[2022-03-22 18:11:10,433] [DEBUG] Command to send: i
java.util.ArrayList
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24
[2022-03-22 18:11:10,434] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro24
e

[2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25
[2022-03-22 18:11:10,434] [DEBUG] Command to send: m
d
o24
e

[2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv
[2022-03-22 18:11:10,435] [DEBUG] Command to send: c
o23
load
ro25
e

22/03/22 18:11:10 WARN DataSource: All paths were ignored:
  

[Stage 0:>                                                          (0 + 1) / 1]

                                                                                
[2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26
[2022-03-22 18:11:11,839] [DEBUG] Command to send: c
o26
toString
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,840] [DEBUG] Command to send: c
o26
getCause
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn
[2022-03-22 18:11:11,840] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,842] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,844] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,848] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
exceptionString
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,849] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
exceptionString
ro26
e

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
[2022-03-22 18:11:11,852] [DEBUG] Command to send: m
d
o0
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o4
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o5
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o6
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o7
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o8
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o9
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o10
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o12
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o15
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o16
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o17
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o18
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o19
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o20
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,881] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal.SQLConf
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
m
org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
z:org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
o27
pysparkJVMStacktraceEnabled
e

[2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse
: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,924] [DEBUG] Command to send: m
d
o27
e

[2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,965] [DEBUG] Command to send: m
d
o26
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o25
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o23
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv

Company missing from the latest report

Hi and thanks for this project :)

We recently raised a PR to add Expedia Group to the list.
Unfortunately we couldn't find the company in the latest report published today. Have we missed anything in the PR or we can expect to make it in the next report?

Query on ranking

Hello,
My organization (Societe Generale) as on Feb 2022 seems to have total community number as 7 (+5) and 0 active contributors. Also it is indicated that we are down by 3 places.
I'd like to know why we are placed at 284 and not 276?

When the no of active contributors are equal, what are considered while ranking?
Please can you help with this query?

Clarification counting method

Am I right in assuming that these are the steps you take to count the open source contributions?

  1. get commits from push event data (GH Archive/BigQuery)
  2. only keep commits to repositories, which do have a license (GitHub API for license info)
  3. match author email domains for selected organizations
  4. use the author email to identify unique contributors and count commits
  5. count total community / active contributors

I just wanted to clarify so I better understand how to interpret the results. Great project.

Bitbucket OSCI

The goal is to create and automate analysis of repos hosted on BitBucket. This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

  1. Solution that crawls data about push events commits (PEC) that should contain the following required fields:
    • event creation date;
    • commit author (email address, name);
    • SHA.
  2. Adapt existing pipeline to process Bitbucket data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Bitbucket. This is a summary of our findings:

Criteria Status (Yes/No) Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects? yes Seems to be free only for teams under 5 people unless you request a community license.
Does it look like this site hosts many open source projects? unclear It's not clear that there are large numbers of open source projects hosted.Most public projects seem to be non-commercial (not outsourced by companies). It appears most users do not use company domains - need to investigate further.The pages/repos of many companies appear to be not active.
Size of user base - In the order of 5,000,000 users
Is there a public API we can query? yes  
API type REST  
API URL https://api.bitbucket.org/2.0  
Query Limits (if any) 1,000 per hour 60,000 per hour  
Is there a paid access with more information? - (to be investigated)  
Is it possible to query the project license? Yes  
Is it possible to query commit events/commit counts by a user in a time period? Yes /repositories?before=timestamp&after=timestamp e.g. https://api.bitbucket.org/2.0/repositories?after=2020-03-01T09%3A37%3A06.254721%2B00%3A00
Is it possible to query email address or else some organization information for the person making a commit? Yes email address
Is there a public archive we can use instead of the public API? no  

If you have additional questions, feel free to contact our team.

Improve identification of committers' organizations

The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.

We would like to improve the identification of committers organization using the data in their user profiles.

<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:

  1. The domain is selected from the commiter's email;
  2. Each domain is compared with the list of company domains (google.com, microsoft.com, etc) regardless of case;
  3. If no match is found, a regular expression analysis is performed for situations with domains of 3 and higher levels.

If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:

  1. The profile information on the user's Github is uploaded;
  2. The website field is taken from the profile, if it is empty, then go to step 3. Otherwise, the basic algorithm is applied on the specified domain. If no matches occurred after applying the basic algorithm, go to step 3.
  3. The company field is taken and compared with the list of companies that we are processing. (Fuzzy band algorithms were used: Levenshtein distance, Sorenson-Dice coefficient, etc)

Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.

It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

TypeError: an integer is required (got type bytes)

when i run the command below

python3 osci.py daily-osci-rankings -td 2020-01-02

i get the following error

Traceback (most recent call last):
  File "osci.py", line 44, in <module>
    from cli import company_rankers
  File "/home/mirai/OSCI/cli/company_rankers.py", line 23, in <module>
    from __app__.jobs.contributors_ranking import ContributorsRankingJob
  File "/home/mirai/OSCI/__app__/jobs/contributors_ranking.py", line 17, in <module>
    from pyspark.sql import DataFrame
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/serializers.py", line 72, in <module>
    from pyspark import cloudpickle
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

os: Ubuntu 20.04.1 LTS
python : Python 3.8.5
pip: 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

any ideas whats causing this ?

Calculating ranking of an organization

Hello,
My understanding is that OSCI considers commits made by an organization AFTER it has been added and any commits BEFORE will NOT be considered for ranking.

Example: I add Societe Generale in September. OSCI does NOT consider any commits made by Soc Gen before September and the ranking is given based on commits made from September to date.

Is that right?

Exclude company's own projects filter

I think it would be pertinent to include a filter that excludes contributions to the company's own open source projects.
As much as I enjoy seeing the numbers I feel like it would be amazing to see which companies contribute outside of their own circle of influence the most, this could shift the rankings somewhat and showcase a bit more of the open source community on the top lists.

Throwing this out there as an idea, absolutely understand if this is not relevant to this project but maybe something worth thinking about!

Unusual spike in data? (starting Nov 21)

While visualizing some of the data OSCI provides I noticed an unusual spike in the data starting end of October/beginning of November 2021 and potentially still ongoing. I was wondering, if there was a change to how active contributors or community are measured?

This graph shows the development for Google over the last 3 years. Besides the usual spikes at the beginning of each year, there is an additional spike in November and after that the daily increase seems to be higher than in previous periodes as well.
image

The spike is there for different companies and not specific for the above example.

Gitlab OSCI

The goal is to create and automate analysis of repos hosted on GitLab (https://gitlab.com). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

  1. Solution that crawls data about push events commits (PEC) that should contain the following required fields:
    • event creation date;
    • commit author (email address, name);
    • SHA.
  2. Adapt existing pipeline to process Gitlab data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Gitlab. This is a summary of our findings:

Criteria Status (Yes/No) Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects? yes Seems to be free with basic features (they sell more advanced CI/CD features)
Does it look like this site hosts many open source projects? yes  
Is there a public API we can query? yes  
API type REST  
API URL https://gitlab.com/api/v4/  
Query Limits (if any) 600 queries per 60 second period  
Is there a paid access with more information? no  
Is it possible to query the project license? no  
Is it possible to query commit events/commit counts by a user in a time period? no  
Is it possible to query email address or else some organization information for the person making a commit? yes email address
Is there a public archive we can use instead of the public API? no  
Any additional Information worth knowing? no  

Indexing Subsidiaries and their email domains

Hello,

What is the policy for addition of subsidiaries to the OSCI index ?
Is it a decision of parent company or of EPAM/OSCI on how to list the email domains ?

Should all subsidiaries of a company be under same umbrella (and index calculation) of parent company
OR
For each subsidiary with a different email domain, a new company addition (and index calculation) with a subsidiary name should be made ?

For example, a company "X" has 2 subsidiaries "Y" and "Z"

Option A:

company: X
  domains:
    - X.com
    - Y.com
    - Z.com

Option B:

company: X
  domains:
    - X.com
    
company: Y
  domains:
    - Y.com
    
company: Z
  domains:
    - Z.com

Which would be acceptable ?

SourceForge OSCI

The goal is to create and automate analysis of repos hosted on SourceForge (https://sourceforge.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

  1. Solution that crawls data about push events commits (PEC) that should contain the following required fields:
    • event creation date;
    • commit author (email address, name);
    • SHA.
  2. Adapt existing pipeline to process SourceForge data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on SourceForge. This is a summary of our findings:

Criteria Status (Yes/No) Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects? yes  
Does it look like this site hosts many open source projects? yes “over 430,000 projects”Popular in open source community.BUT, it hosts a lot of binaries and mirrors of repos which are primarily hosted on github or elsewhere.
Size of user base - "we host over 3.7 million registered users”
Is there a public API we can query? yes  
API type not studied yet  
API URL not studied yet  
Query Limits (if any) not studied yet  
Is there a paid access with more information? not studied yet  
Is it possible to query the project license? not studied yet  
Is it possible to query commit events/commit counts by a user in a time period? not studied yet  
Is it possible to query email address or else some organization information for the person making a commit? not studied yet  
Is there a public archive we can use instead of the public API? not studied yet  
Any additional Information worth knowing? not studied yet  

Filtering bots from OSCI ranking

The goal is to improve our existing OSCI code which ranks companies on the basis of the number of commits, because the current situation is that there appear to be large number of of commits done by automated processes associated with GitHub accounts that have a company (commercial organization) email domain. These skew the ranking of companies based on commits, which is precisely why our OSCI ranking is based on number of contributors rather than number of commits.

For example, when we look at the OSCI commit-based company counts to end June 2020, we see

OrgName Commits
Microsoft 640009
GitHub 519108
Renovateapp 472705
Google 379847
Red Hat 331087
Travis CI 195377
Intel 150613
IBM 131510
Exoplatform 125844
Odoo 113452
Pyup 82118

However, Renovateapp, Travis CI, Exoplatform and Pyup do not feature highly in our OSCI countributor-based company ranking. In fact, Renovateapp has only 4 active contributors, Travis CI has 67, Exoplatform has 41, Pyup has 4.

When we dig deeper into this, we see:

This is top of commits authors for Pyup:

Company AuthorName Commits
Pyup pyup-bot 349717
Pyup pyup.io bot 10146
Pyup pyup.io vuln bot 22
Pyup pyup.io bot (via Travis CI) 1

As you can see all of them are bots.
The same picture for Renovateapp:

Company AuthorName Commits
Renovateapp Renovate Bot 2348935
Renovateapp WhiteSource Renovate 65148
Renovateapp Renovate Bot (via Travis CI) 358
Renovateapp renovate-bot 63
Renovateapp Rhys Arkins 3

TravisCI (Top 10 by commits):

Company AuthorName Commits
Travis CI Deployment Bot (from Travis CI) 426727
Travis CI Travis CI 92799
Travis CI travis-ci 11824
Travis CI TravisCI 9511
Travis CI Travis 8128
Travis CI Deployment Bot (Travis) 7723
Travis CI Deployment Bot 1917
Travis CI raveit65 1322
Travis CI Piotr Milcarz 1317
Travis CI Travis Build Bot (from Travis CI) 1015

The biggest part of commits comming from bots

We would like a way to filter out these automated processes/bot commits, so that we could more accurately generate a ranking of companies based on commits.

One obvious way is to simply have a 'blacklist' of GitHub accounts / email addresses, but perhaps something more sophisticated could be devised, based on 'unhuman' levels of activity.

At the moment, we are using the domain <-> company match list, which filters companies from the top that we form. Perhaps the problem of bots can be solved by creating a similar list that will filter out bots.

LaunchPad OSCI

The goal is to create and automate analysis of repos hosted on LaunchPad (https://launchpad.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

  1. Solution that crawls data about push events commits (PEC) that should contain the following required fields:
    • event creation date;
    • commit author (email address, name);
    • SHA.
  2. Adapt existing pipeline to process LaunchPad data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on LaunchPad. This is a summary of our findings:

Criteria Status (Yes/No) Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects? yes  
Does it look like this site hosts many open source projects? yes In total (all projects): “43,314 projects, 1,808,413 bugs, 1,004,760 branches, 17,009 Git repositories, 2,977,004 translations, 685,925 answers, 77,280 blueprints, and counting...”   https://launchpad.net/projects/+all?batch=75Seems to be popular for Ubuntu community mainly.many repos appear to be mirrors of projects which are hosted elsewhere - need more data to provea lot of linux-focused projects
Size of user base - c. 4,000,000
Is there a public API we can query? yes  
API type HTTP  
API URL http://api.launchpad.net/1.0/  
Query Limits (if any) - (to be investigated)  
Is there a paid access with more information? - (to be investigated)  
Is it possible to query the project license? Yes  
Is it possible to query commit events/commit counts by a user in a time period? Yes  
Is it possible to query email address or else some organization information for the person making a commit? Yes email address (required an authentication)
Is there a public archive we can use instead of the public API? no  

Location-based OSCI rating

Proposed by @abitrolly.

Add the ability to filter companies by regions (created after discussion in #5 and #6)

Country of registration
Country of presence
Country of origin (native perception)
This will require some external datasets.

Identify and classify non-standard open source licenses in GitHub repos

The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.

Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:

  1. If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
  2. If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.
  3. If we still can't match the license, we use a fancy math thing called the Sørensen – Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 95% similar to the MIT license, that 5% likely representing legally insignificant changes to the license text.

We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.

The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.

Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.

Some examples:
https://github.com/MicrosoftDocs/microsoft-365-docs/blob/public/LICENSE - Creative Commons license
https://github.com/dotnet/runtime/blob/master/LICENSE.TXT - MIT license
https://github.com/IBM-Cloud/webapp-with-cos-and-cdn/blob/master/License.txt, https://github.com/IBM-Cloud/serverless-followupapp-ios/blob/master/License.txt- Apache 2.0
https://github.com/strongloop/loopback.io/blob/gh-pages/LICENSE, https://github.com/strongloop/loopback-next/blob/master/LICENSE - MIT license
https://github.com/mono/mono/blob/master/LICENSE - mix of licenses, so won't be possible to identify a single license type

Unable to get data

I am trying to run the example commands, and there doesn't seem to be a /data folder, or the permissions for it are wrong. Am I missing something? Thank you.

Error:

➜  OSCI git:(master) python3 osci.py get-github-daily-push-events -d 2020-01-01
[2021-02-11 17:55:21,671] [INFO] ENV: None
[2021-02-11 17:55:21,671] [DEBUG] Check config file for env local exists
[2021-02-11 17:55:21,671] [DEBUG] Read config from /Users/richard/src/OSCI/__app__/config/files/local.yml
[2021-02-11 17:55:21,674] [INFO] Configuration loaded for env: local
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.LocalFileSystemConfig'>
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.Config'>
[2021-02-11 17:55:21,676] [DEBUG] Create new <class '__app__.datalake.datalake.DataLake'>
[2021-02-11 17:55:21,681] [INFO] Crawl events for 2020-01-01 00:00:00
[2021-02-11 17:55:21,681] [INFO] Load events for date: 2020-01-01 00:00:00
[2021-02-11 17:55:21,691] [DEBUG] Starting new HTTPS connection (1): data.gharchive.org:443
[2021-02-11 17:55:21,968] [DEBUG] https://data.gharchive.org:443 "GET /2020-01-01-0.json.gz HTTP/1.1" 200 15670114
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "osci.py", line 75, in <module>
    cli(standalone_mode=False)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/richard/src/OSCI/cli/gharchive.py", line 36, in get_github_daily_push_events
    gharchive.get_github_daily_push_events(day=day)
  File "/Users/richard/src/OSCI/__app__/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events
    DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 39, in save_push_events_commits
    file_path = self._get_hourly_push_events_commits_path(date)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 73, in _get_hourly_push_events_commits_path
    return self.get_push_events_commits_parent_dir(date=date, create_if_not_exists=True) / \
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 69, in get_push_events_commits_parent_dir
    path.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  [Previous line repeated 4 more times]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/data'

Add CI

@patrickstephens1 adding Travis CI or Cirrus CI to this repo, will help to see if my PR for #2 doesn't break anything.

Report issues - data does not add up

I found 2 issues on the local generated reports:

people appear in the contributor ranking report but it are missing from the repository commits
EX:
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep enderborg
Sony,peter enderborg,xxxxxxxxxxxxx@xxxxxxxxxx,57

cat Company-contributors-repository-commits_YTD_2022-01-31.csv | grep enderborg
returns nothing

Do you have any idea why some persons are missing ?

In the same report my contributions are counted separate for the same email address
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep Alin
Sony,Alin Jerpelea,xxxxxxxxxxxxx@xxxxxxxxxx,90
Sony,Alin,xxxxxxxxxxxxx@xxxxxxxxxx,56
(the email address is the same)

Thanks

Savannah OSCI

The goal is to create and automate analysis of repos hosted on Savannah (https://savannah.gnu.org/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

  1. Solution that crawls data about push events commits (PEC) that should contain the following required fields:
    • event creation date;
    • commit author (email address, name);
    • SHA.
  2. Adapt existing pipeline to process Savannah data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Savannah. This is a summary of our findings:

Criteria Status (Yes/No) Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects? yes  
Does it look like this site hosts many open source projects? yes In total (all projects): "23990 registered users, 3829 hosted projects"
Is there a public API we can query? no however, we can parse HTML pages
API type -  
API URL -  
Query Limits (if any) -  
Is there a paid access with more information? -  
Is it possible to query the project license? Yes by parsing HTML page
Is it possible to query commit events/commit counts by a user in a time period? no  
Is it possible to query email address or else some organization information for the person making a commit? Yes by parsing HTML pageemail address
Is there a public archive we can use instead of the public API? no  
Any additional Information worth knowing? yes it is possible to get information about commits with using parsing web pages if project is based on GIT

OS Distribution vs OS Development Index

Releasing code on GitHub doesn't mean that the project will be properly maintained once company looses interest in it. Project survivability is better if in addition to "Open Source as a Distribution Model" the project also exhibits open governance and participation model, and helps people to socialize. In contrast, many companies who release code in open, are not interested to support external contributions and communicate with general public. Depending on such projects in a long term is risky.

The metric that shows the commitment of companies towards OS Distribution Model rather than open governance and development, could help not only with evaluation a solution as to be sustainable, but also with help with drafting more effective Open Source Policies in companies.

For researchers and initiatives, such as SustainOSS, it will also be beneficial to get deeper analysis into survivability and inclusiveness of projects with and without companies support. To draft the best practices is it necessary to know how many companies are committing only into their own repos, and how many of them collaborate with other companies and individual maintainers. This data then be used for further analysis if corporate sponsorships for the projects such as Django Foundation, PFS etc. provides more value than forking and maintaining own toolsets, and also can be used as an argument for business to support such foundations.

We are all interested in using maintained and secure solutions. It may happen that using open development models not only helps projects survive in a long term, but also provides secondary benefits, such as spreading good engineering practices, socializing and onboarding newcomers.

Find your own company

Currently both the list of companies and emails is hardcoded. At the very least as instruction how to build the stats for your company or employee can be added.

There is a concern that for other companies the requirements to use corporate email for contributions is not so strict, especially if contributors or maintainers are hired or sponsored by corporations to do some jobs.

Decouple MS SQL queries from local filesystem

OSCI software uses MS SQL specific instructions to load data from a file on a local filesystem. The error is easily repeatable when using MS SQL on different hosts or in CI/CD container - https://travis-ci.com/epam/OSCI/builds/148387944#L369

[SQL Server]Cannot bulk load. The file "/home/travis/build/epam/OSCI/resources/2019-01-01-9-formatted.json" does not exist or you don\'t have file access rights. (4860)

Blocker for porting code to an open source database (#2) and for integration tests.

script: echo "Skipped til fix https://github.com/epam/OSCI/issues/9" # - python -m pytest test/integration

Extension of analytics scope (Add licenses and programming languages)

Plans

We plan to expand the scope of research.

We want to add two new reports:

  1. OSCI_Languages_YTD: report on the number of the company commits in the programming language since the beginning of the year.
  2. OSCI_Licenses_YTD: report on the number of the company commits in the repository with a license since the beginning of the year.

TODO

OSCI Languages YTD

  1. create transformation function, which gets push events commits as input and returns the amount of commits report grouped by company and language;
  2. create spark job;
  3. create cli command for this job;
  4. add job to daily-osci-rankings cli command.

Example output:

company language commits
Google python 50
Google go 30
Microsoft typescript 40
Microsoft powershell 20

OSCI Licenses YTD

  1. create transformation function, which gets push events commits as input and returns the amount of commits report grouped by company and license
  2. create spark job
  3. create cli command for this job
  4. add job to daily-osci-rankings cli command

Example output:

company license commits
Google apache-2.0 50
Google mit 30
Microsoft gpl-3.0 40
Microsoft lgpl-2.1 20

License clarity

Looks like some files in this repo were copied from Mypy project (471ccc3). It would be nice to see that Epam actually fulfilled all License obligations towards open source projects it is using, giving proper credits where due.

Something was missed in Industry drop-down list

Hello,
I examine https://opensourceindex.io/

What I did:
-Click Industry menu filter
-Uncheck "select all"
-Scroll down
-Check empty line between lines "Healthcare & Pharma" and "Public Sector"

What I got:
-Line for FARFETCH organization.

What I expect:
-I don't have empty lines inside Industry drop-down menu filter

I check - osci/preprocess/match_company/company_domain_match_list.yaml
I assume that industry field was missed for FARFETCH company

Possible solutions:

  1. Add industry field for FARFETCH with value: "Retail & Hospitality" or "Other"
  2. Add rules fields in yaml file. Example: "we have set of mandatory fields, this fields cannot be empty"

Thanks.
empty_line1

Extension of analytics scope (Add companies contributors reports)

Plans

We plan to expand the scope of research.

We want to add two new reports:

  1. Company-contributors-repository-commits: a report on the number of the company commits in the repository (with information about the license and programming language) per day (this data is required to record in Google BQ).
  2. OSCI_Contributors_YTD: report on the number of commits by company employees since the beginning of the year.

TODO

Company contributors repository commits

  1. create transformation function, which gets push events commits as input and returns the number of commits report grouped by company, repository (language, license), contributor for a day;
  2. create spark job;
  3. add job to daily-osci-rankings cli command.

Example output:

author_email author_name repo_name language license company commits
[email protected] Lorem datalayer-contrib/hadoop Java apache-2.0 Cloudera 3
[email protected] Ipsum docker-library/docs Shell mit Infosiftr 200
[email protected] Dolor golang/go Go other Google 12
[email protected] Sit konnectors/darty JavaScript agpl-3.0 Renovateapp 4

OSCI Contributors YTD

  1. create transformation function, which gets push events commits as input and returns top-5 contributors for all companies based on the amount of commits;
  2. create spark job;
  3. create cli command for this job;
  4. add job to daily-osci-rankings cli command

Example output:

company author author_email commits
Google Lorem [email protected] 50
Google Ipsum [email protected] 30
Google Dolor [email protected] 20
Microsoft Sit [email protected] 40
Microsoft Amet [email protected] 20
Microsoft Consectetur [email protected] 10

OSCI shows wrong count of ACTIVE CONTRIBUTORS in yearly date

When the YEARLY data is chosen, the list shows the same number of ACTIVE CONTRIBUTORS and TOTAL COMMUNITY per year and company then for January of the same year. Therefor the data seems to be wrong and at least the ranking by year could be inaccurate.
Could this problem be fixed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.