epam / osci Goto Github PK

View Code? Open in Web Editor NEW

151.0 14.0 91.0 1.12 MB

Open Source Contributor Index

Home Page: https://opensourceindex.io/

License: GNU General Public License v3.0

Python 98.91% Dockerfile 0.15% HTML 0.95%

open-source analytics pyspark python azure-functions

osci's Issues

Certificate expired

The https certificate for opensourceindex.io expired yesterday

Add number of repositories contributed to.

It would be neat to see the total number of repositories every organization from the index was contributing to.

Is a not-merged PR considered an Active Contribution?

If a user makes a contribution to a branch or a fork of a repository but these contributions are not merged Into master or original repository, is this still considered to be an Active Contribution?

TypeError: an integer is required (got type bytes)

when i run the command below

python3 osci.py daily-osci-rankings -td 2020-01-02

i get the following error

Traceback (most recent call last):
  File "osci.py", line 44, in <module>
    from cli import company_rankers
  File "/home/mirai/OSCI/cli/company_rankers.py", line 23, in <module>
    from __app__.jobs.contributors_ranking import ContributorsRankingJob
  File "/home/mirai/OSCI/__app__/jobs/contributors_ranking.py", line 17, in <module>
    from pyspark.sql import DataFrame
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/serializers.py", line 72, in <module>
    from pyspark import cloudpickle
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

os: Ubuntu 20.04.1 LTS
python : Python 3.8.5
pip: 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)

any ideas whats causing this ?

Find your own company

Currently both the list of companies and emails is hardcoded. At the very least as instruction how to build the stats for your company or employee can be added.

There is a concern that for other companies the requirements to use corporate email for contributions is not so strict, especially if contributors or maintainers are hired or sponsored by corporations to do some jobs.

Adobe to acquire Magento

Adobe announced that it was acquiring Magento for $1.68 billion.

Adobe to Acquire Magento Commerce | news.adobe.com
Magento is Now Part of Adobe | magento.com

Magento.com site is now a blog under Adobe.
They announce proudly that they are an Adobe Company.

We suggest to merge them

Indexing Subsidiaries and their email domains

Hello,

What is the policy for addition of subsidiaries to the OSCI index ?
Is it a decision of parent company or of EPAM/OSCI on how to list the email domains ?

Should all subsidiaries of a company be under same umbrella (and index calculation) of parent company
OR
For each subsidiary with a different email domain, a new company addition (and index calculation) with a subsidiary name should be made ?

For example, a company "X" has 2 subsidiaries "Y" and "Z"

Option A:

company: X
  domains:
    - X.com
    - Y.com
    - Z.com

Option B:

company: X
  domains:
    - X.com
    
company: Y
  domains:
    - Y.com
    
company: Z
  domains:
    - Z.com

Which would be acceptable ?

Kibana is misclassified as Open Source

https://opensourceindex.io/?company=Elastic reports https://github.com/elastic/kibana/ as a "top repo"

But Kibana is no longer Open Source since 2021-02
elastic/kibana#90099

The forks are also misclassified:

pgayvallet/kibana
davismcphee/kibana
rshen91/kibana
etc

Unable to get data

I am trying to run the example commands, and there doesn't seem to be a /data folder, or the permissions for it are wrong. Am I missing something? Thank you.

Error:

➜  OSCI git:(master) python3 osci.py get-github-daily-push-events -d 2020-01-01
[2021-02-11 17:55:21,671] [INFO] ENV: None
[2021-02-11 17:55:21,671] [DEBUG] Check config file for env local exists
[2021-02-11 17:55:21,671] [DEBUG] Read config from /Users/richard/src/OSCI/__app__/config/files/local.yml
[2021-02-11 17:55:21,674] [INFO] Configuration loaded for env: local
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.LocalFileSystemConfig'>
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.Config'>
[2021-02-11 17:55:21,676] [DEBUG] Create new <class '__app__.datalake.datalake.DataLake'>
[2021-02-11 17:55:21,681] [INFO] Crawl events for 2020-01-01 00:00:00
[2021-02-11 17:55:21,681] [INFO] Load events for date: 2020-01-01 00:00:00
[2021-02-11 17:55:21,691] [DEBUG] Starting new HTTPS connection (1): data.gharchive.org:443
[2021-02-11 17:55:21,968] [DEBUG] https://data.gharchive.org:443 "GET /2020-01-01-0.json.gz HTTP/1.1" 200 15670114
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "osci.py", line 75, in <module>
    cli(standalone_mode=False)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/richard/src/OSCI/cli/gharchive.py", line 36, in get_github_daily_push_events
    gharchive.get_github_daily_push_events(day=day)
  File "/Users/richard/src/OSCI/__app__/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events
    DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 39, in save_push_events_commits
    file_path = self._get_hourly_push_events_commits_path(date)
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 73, in _get_hourly_push_events_commits_path
    return self.get_push_events_commits_parent_dir(date=date, create_if_not_exists=True) / \
  File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 69, in get_push_events_commits_parent_dir
    path.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
    self.parent.mkdir(parents=True, exist_ok=True)
  [Previous line repeated 4 more times]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/data'

Extension of analytics scope (Add companies contributors reports)

Plans

We plan to expand the scope of research.

We want to add two new reports:

Company-contributors-repository-commits: a report on the number of the company commits in the repository (with information about the license and programming language) per day (this data is required to record in Google BQ).
OSCI_Contributors_YTD: report on the number of commits by company employees since the beginning of the year.

TODO

Company contributors repository commits

create transformation function, which gets push events commits as input and returns the number of commits report grouped by company, repository (language, license), contributor for a day;
create spark job;
add job to daily-osci-rankings cli command.

Example output:

author_email	author_name	repo_name	language	license	company	commits
[email protected]	Lorem	datalayer-contrib/hadoop	Java	apache-2.0	Cloudera	3
[email protected]	Ipsum	docker-library/docs	Shell	mit	Infosiftr	200
[email protected]	Dolor	golang/go	Go	other	Google	12
[email protected]	Sit	konnectors/darty	JavaScript	agpl-3.0	Renovateapp	4

OSCI Contributors YTD

create transformation function, which gets push events commits as input and returns top-5 contributors for all companies based on the amount of commits;
create spark job;
create cli command for this job;
add job to daily-osci-rankings cli command

Example output:

company	author	author_email	commits
Google	Lorem	[email protected]	50
Google	Ipsum	[email protected]	30
Google	Dolor	[email protected]	20
Microsoft	Sit	[email protected]	40
Microsoft	Amet	[email protected]	20
Microsoft	Consectetur	[email protected]	10

License clarity

Looks like some files in this repo were copied from Mypy project (471ccc3). It would be nice to see that Epam actually fulfilled all License obligations towards open source projects it is using, giving proper credits where due.

LaunchPad OSCI

The goal is to create and automate analysis of repos hosted on LaunchPad (https://launchpad.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

Solution that crawls data about push events commits (PEC) that should contain the following required fields:
- event creation date;
- commit author (email address, name);
- SHA.
Adapt existing pipeline to process LaunchPad data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on LaunchPad. This is a summary of our findings:

Criteria	Status (Yes/No)	Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects?	yes
Does it look like this site hosts many open source projects?	yes	In total (all projects): “43,314 projects, 1,808,413 bugs, 1,004,760 branches, 17,009 Git repositories, 2,977,004 translations, 685,925 answers, 77,280 blueprints, and counting...” https://launchpad.net/projects/+all?batch=75Seems to be popular for Ubuntu community mainly.many repos appear to be mirrors of projects which are hosted elsewhere - need more data to provea lot of linux-focused projects
Size of user base	-	c. 4,000,000
Is there a public API we can query?	yes
API type	HTTP
API URL	http://api.launchpad.net/1.0/
Query Limits (if any)	- (to be investigated)
Is there a paid access with more information?	- (to be investigated)
Is it possible to query the project license?	Yes
Is it possible to query commit events/commit counts by a user in a time period?	Yes
Is it possible to query email address or else some organization information for the person making a commit?	Yes	email address (required an authentication)
Is there a public archive we can use instead of the public API?	no

Clarification counting method

Am I right in assuming that these are the steps you take to count the open source contributions?

get commits from push event data (GH Archive/BigQuery)
only keep commits to repositories, which do have a license (GitHub API for license info)
match author email domains for selected organizations
use the author email to identify unique contributors and count commits
count total community / active contributors

I just wanted to clarify so I better understand how to interpret the results. Great project.

OSCI shows wrong count of ACTIVE CONTRIBUTORS in yearly date

When the YEARLY data is chosen, the list shows the same number of ACTIVE CONTRIBUTORS and TOTAL COMMUNITY per year and company then for January of the same year. Therefor the data seems to be wrong and at least the ranking by year could be inaccurate.
Could this problem be fixed?

Unable to get basic example to run

Hey, folks --

I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.

I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.

Any help pointing me in the right direction would be appreciated!

$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)

# ...

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data

# ...

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from

Full Error Log:

[2022-03-22 18:11:05,996] [INFO] ENV: None
[2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists
[2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml
[2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.LocalFileSystemConfig'>
[2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.Config'>
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.datalake.datalake.DataLake'>
[2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings`
[2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}`
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.osci_ranking.OSCIRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.commits_ranking.OSCICommitsRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.jobs.session.Session'>
[2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) []
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2022-03-22 18:11:08,127] [DEBUG] Command to send: A
fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea

[2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,142] [DEBUG] Command to send: j
i
rj
org.apache.spark.SparkConf
e

[2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,143] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.java.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.ml.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.mllib.api.python.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.*
e

[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.api.python.*
e

[2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,145] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.hive.*
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: j
i
rj
scala.Tuple2
e

[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: r
u
SparkConf
rj
e

[2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf
[2022-03-22 18:11:08,148] [DEBUG] Command to send: i
org.apache.spark.SparkConf
bTrue
e

[2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0
[2022-03-22 18:11:08,154] [DEBUG] Command to send: c
o0
contains
sspark.serializer.objectStreamReset
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,158] [DEBUG] Command to send: c
o0
set
sspark.serializer.objectStreamReset
s100
e

[2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1
[2022-03-22 18:11:08,158] [DEBUG] Command to send: m
d
o1
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
contains
sspark.rdd.compress
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
set
sspark.rdd.compress
sTrue
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2
[2022-03-22 18:11:08,159] [DEBUG] Command to send: m
d
o2
e

[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e

[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
get
sspark.master
e

[2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,161] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
get
sspark.app.name
e

[2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
contains
sspark.home
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,163] [DEBUG] Command to send: c
o0
getAll
e

[2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3
[2022-03-22 18:11:08,163] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,164] [DEBUG] Command to send: a
g
o3
i0
e

[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4
[2022-03-22 18:11:08,164] [DEBUG] Command to send: c
o4
_1
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress
[2022-03-22 18:11:08,165] [DEBUG] Command to send: c
o4
_2
e

[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
g
o3
i1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5
[2022-03-22 18:11:08,166] [DEBUG] Command to send: c
o5
_1
e

[2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o5
_2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
g
o3
i2
e

[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o6
_1
e

[2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master
[2022-03-22 18:11:08,170] [DEBUG] Command to send: c
o6
_2
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
g
o3
i3
e

[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: c
o7
_1
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles
[2022-03-22 18:11:08,172] [DEBUG] Command to send: c
o7
_2
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
g
o3
i4
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_1
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_2
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
g
o3
i5
e

[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_1
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_2
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
g
o3
i6
e

[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_1
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_2
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,175] [DEBUG] Command to send: a
e
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,175] [DEBUG] Command to send: m
d
o3
e

[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,175] [DEBUG] Command to send: r
u
JavaSparkContext
rj
e

[2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext
[2022-03-22 18:11:08,186] [DEBUG] Command to send: i
org.apache.spark.api.java.JavaSparkContext
ro0
e

[2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11
[2022-03-22 18:11:09,483] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12
[2022-03-22 18:11:09,490] [DEBUG] Command to send: c
o12
conf
e

[2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13
[2022-03-22 18:11:09,500] [DEBUG] Command to send: r
u
PythonAccumulatorV2
rj
e

[2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2
[2022-03-22 18:11:09,502] [DEBUG] Command to send: i
org.apache.spark.api.python.PythonAccumulatorV2
s127.0.0.1
i45879
sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14
[2022-03-22 18:11:09,502] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15
[2022-03-22 18:11:09,503] [DEBUG] Command to send: c
o15
register
ro14
e

[2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,505] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:09,506] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
e

[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,506] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
ro11
e

[2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,508] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
u
org.apache.spark.SparkFiles
rj
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
m
org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,511] [DEBUG] Command to send: c
z:org.apache.spark.SparkFiles
getRootDirectory
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1
[2022-03-22 18:11:09,512] [DEBUG] Command to send: c
o13
get
sspark.submit.pyFiles
s
e

[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys
[2022-03-22 18:11:09,513] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,514] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,516] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,517] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
getLocalDir
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o16
conf
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17
[2022-03-22 18:11:09,520] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
getLocalDir
ro17
e

[2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
[2022-03-22 18:11:09,520] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,521] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
createTempDir
e

[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
createTempDir
s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
spyspark
e

[2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
o18
getAbsolutePath
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101
[2022-03-22 18:11:09,525] [DEBUG] Command to send: c
o13
get
sspark.python.profile
sfalse
e

[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse
[2022-03-22 18:11:09,525] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,545] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,567] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
getDefaultSession
e

[2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19
[2022-03-22 18:11:09,568] [DEBUG] Command to send: c
o19
isDefined
e

[2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,569] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,570] [DEBUG] Command to send: c
o11
sc
e

[2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20
[2022-03-22 18:11:09,571] [DEBUG] Command to send: i
org.apache.spark.sql.SparkSession
ro20
e

[2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21
[2022-03-22 18:11:09,620] [DEBUG] Command to send: c
o21
sqlContext
e

[2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22
[2022-03-22 18:11:09,621] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,622] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setDefaultSession
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,623] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setDefaultSession
ro21
e

[2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,623] [DEBUG] Command to send: r
u
SparkSession
rj
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,624] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setActiveSession
e

[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,624] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setActiveSession
ro21
e

[2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,625] [DEBUG] Command to send: c
o22
read
e

[2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23
[2022-03-22 18:11:10,432] [DEBUG] Command to send: r
u
PythonUtils
rj
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:10,433] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym
[2022-03-22 18:11:10,433] [DEBUG] Command to send: i
java.util.ArrayList
e

[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24
[2022-03-22 18:11:10,434] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro24
e

[2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25
[2022-03-22 18:11:10,434] [DEBUG] Command to send: m
d
o24
e

[2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv
[2022-03-22 18:11:10,435] [DEBUG] Command to send: c
o23
load
ro25
e

22/03/22 18:11:10 WARN DataSource: All paths were ignored:
  

[Stage 0:>                                                          (0 + 1) / 1]

                                                                                
[2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26
[2022-03-22 18:11:11,839] [DEBUG] Command to send: c
o26
toString
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,840] [DEBUG] Command to send: c
o26
getCause
e

[2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn
[2022-03-22 18:11:11,840] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,842] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,844] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,848] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
exceptionString
e

[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,849] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
exceptionString
ro26
e

[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n	at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n	at scala.Option.getOrElse(Option.scala:189)\n	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n	at java.lang.reflect.Method.invoke(Method.java:498)\n	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n	at py4j.Gateway.invoke(Gateway.java:282)\n	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n	at py4j.commands.CallCommand.execute(CallCommand.java:79)\n	at py4j.GatewayConnection.run(GatewayConnection.java:238)\n	at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
[2022-03-22 18:11:11,852] [DEBUG] Command to send: m
d
o0
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o4
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o5
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o6
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o7
e

[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o8
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o9
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o10
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o12
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o15
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o16
e

[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o17
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o18
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o19
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o20
e

[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
  File "osci-cli.py", line 93, in <module>
    cli(standalone_mode=False)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
    return self._execute(**self._process_params(kwargs))
  File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
    commits = osci_ranking_job.extract(to_date=to_day).cache()
  File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
    commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
  File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
    return self.spark_session.read.load(paths, **options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
    return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r
u
org
rj
e

[2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,881] [DEBUG] Command to send: r
u
org.apache
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql
rj
e

[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal.SQLConf
rj
e

[2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
m
org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
z:org.apache.spark.sql.internal.SQLConf
get
e

[2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
o27
pysparkJVMStacktraceEnabled
e

[2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse
: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,924] [DEBUG] Command to send: m
d
o27
e

[2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,965] [DEBUG] Command to send: m
d
o26
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o25
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o23
e

[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv

HashiCorp projects are misclassified as Open Source

https://opensourceindex.io/?company=HashiCorp reports some projects in "top repo"

"HashiCorp, the vendor of Vagrant, Terraform, and a number of other deployment-automation tools, is changing its software license to the Business Source License." Source: https://www.theregister.com/2023/08/11/hashicorp_bsl_licence/

Identify and classify non-standard open source licenses in GitHub repos

The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.

Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:

If the license file has an explicit copyright notice, and nothing more (e.g., Copyright (c) 2015 Ben Balter), we'll assume the author intends to retain all rights, and thus the project isn't licensed.
If the license is an exact match to a known license. If we strip away whitespace and copyright notice, we might get lucky, and direct string comparison in Ruby is cheap.
If we still can't match the license, we use a fancy math thing called the Sørensen – Dice coefficient, which is really good at calculating the similarity between two strings. By calculating the percent changed from the known license to the license file, you can tell, e.g., that a given license is 95% similar to the MIT license, that 5% likely representing legally insignificant changes to the license text.

We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.

The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.

Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.

Some examples:
https://github.com/MicrosoftDocs/microsoft-365-docs/blob/public/LICENSE - Creative Commons license
https://github.com/dotnet/runtime/blob/master/LICENSE.TXT - MIT license
https://github.com/IBM-Cloud/webapp-with-cos-and-cdn/blob/master/License.txt, https://github.com/IBM-Cloud/serverless-followupapp-ios/blob/master/License.txt- Apache 2.0
https://github.com/strongloop/loopback.io/blob/gh-pages/LICENSE, https://github.com/strongloop/loopback-next/blob/master/LICENSE - MIT license
https://github.com/mono/mono/blob/master/LICENSE - mix of licenses, so won't be possible to identify a single license type

Add CI

@patrickstephens1 adding Travis CI or Cirrus CI to this repo, will help to see if my PR for #2 doesn't break anything.

Savannah OSCI

The goal is to create and automate analysis of repos hosted on Savannah (https://savannah.gnu.org/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

Solution that crawls data about push events commits (PEC) that should contain the following required fields:
- event creation date;
- commit author (email address, name);
- SHA.
Adapt existing pipeline to process Savannah data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Savannah. This is a summary of our findings:

Criteria	Status (Yes/No)	Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects?	yes
Does it look like this site hosts many open source projects?	yes	In total (all projects): "23990 registered users, 3829 hosted projects"
Is there a public API we can query?	no	however, we can parse HTML pages
API type	-
API URL	-
Query Limits (if any)	-
Is there a paid access with more information?	-
Is it possible to query the project license?	Yes	by parsing HTML page
Is it possible to query commit events/commit counts by a user in a time period?	no
Is it possible to query email address or else some organization information for the person making a commit?	Yes	by parsing HTML pageemail address
Is there a public archive we can use instead of the public API?	no
Any additional Information worth knowing?	yes	it is possible to get information about commits with using parsing web pages if project is based on GIT

Decouple MS SQL queries from local filesystem

OSCI software uses MS SQL specific instructions to load data from a file on a local filesystem. The error is easily repeatable when using MS SQL on different hosts or in CI/CD container - https://travis-ci.com/epam/OSCI/builds/148387944#L369

[SQL Server]Cannot bulk load. The file "/home/travis/build/epam/OSCI/resources/2019-01-01-9-formatted.json" does not exist or you don\'t have file access rights. (4860)

Blocker for porting code to an open source database (#2) and for integration tests.

OSCI/.travis.yml

Line 24 in b42ee51

    
           script: echo "Skipped til fix https://github.com/epam/OSCI/issues/9" # - python -m pytest test/integration

Location-based OSCI rating

Proposed by @abitrolly.

Add the ability to filter companies by regions (created after discussion in #5 and #6)

Country of registration
Country of presence
Country of origin (native perception)
This will require some external datasets.

Timezones affecting monthly filter drop-down around end of month?

Hi,

Similar to #118, I'm eagerly looking for my company's stats which I think should appear for March (we were added in v2022.03.0) but I don't see March in the drop-down...

Gitlab OSCI

The goal is to create and automate analysis of repos hosted on GitLab (https://gitlab.com). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

Solution that crawls data about push events commits (PEC) that should contain the following required fields:
- event creation date;
- commit author (email address, name);
- SHA.
Adapt existing pipeline to process Gitlab data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Gitlab. This is a summary of our findings:

Criteria	Status (Yes/No)	Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects?	yes	Seems to be free with basic features (they sell more advanced CI/CD features)
Does it look like this site hosts many open source projects?	yes
Is there a public API we can query?	yes
API type	REST
API URL	https://gitlab.com/api/v4/
Query Limits (if any)	600 queries per 60 second period
Is there a paid access with more information?	no
Is it possible to query the project license?	no
Is it possible to query commit events/commit counts by a user in a time period?	no
Is it possible to query email address or else some organization information for the person making a commit?	yes	email address
Is there a public archive we can use instead of the public API?	no
Any additional Information worth knowing?	no

OS Distribution vs OS Development Index

Releasing code on GitHub doesn't mean that the project will be properly maintained once company looses interest in it. Project survivability is better if in addition to "Open Source as a Distribution Model" the project also exhibits open governance and participation model, and helps people to socialize. In contrast, many companies who release code in open, are not interested to support external contributions and communicate with general public. Depending on such projects in a long term is risky.

The metric that shows the commitment of companies towards OS Distribution Model rather than open governance and development, could help not only with evaluation a solution as to be sustainable, but also with help with drafting more effective Open Source Policies in companies.

For researchers and initiatives, such as SustainOSS, it will also be beneficial to get deeper analysis into survivability and inclusiveness of projects with and without companies support. To draft the best practices is it necessary to know how many companies are committing only into their own repos, and how many of them collaborate with other companies and individual maintainers. This data then be used for further analysis if corporate sponsorships for the projects such as Django Foundation, PFS etc. provides more value than forking and maintaining own toolsets, and also can be used as an argument for business to support such foundations.

We are all interested in using maintained and secure solutions. It may happen that using open development models not only helps projects survive in a long term, but also provides secondary benefits, such as spreading good engineering practices, socializing and onboarding newcomers.

Filtering bots from OSCI ranking

The goal is to improve our existing OSCI code which ranks companies on the basis of the number of commits, because the current situation is that there appear to be large number of of commits done by automated processes associated with GitHub accounts that have a company (commercial organization) email domain. These skew the ranking of companies based on commits, which is precisely why our OSCI ranking is based on number of contributors rather than number of commits.

For example, when we look at the OSCI commit-based company counts to end June 2020, we see

OrgName	Commits
Microsoft	640009
GitHub	519108
Renovateapp	472705
Google	379847
Red Hat	331087
Travis CI	195377
Intel	150613
IBM	131510
Exoplatform	125844
Odoo	113452
Pyup	82118

However, Renovateapp, Travis CI, Exoplatform and Pyup do not feature highly in our OSCI countributor-based company ranking. In fact, Renovateapp has only 4 active contributors, Travis CI has 67, Exoplatform has 41, Pyup has 4.

When we dig deeper into this, we see:

This is top of commits authors for Pyup:

Company	AuthorName	Commits
Pyup	pyup-bot	349717
Pyup	pyup.io bot	10146
Pyup	pyup.io vuln bot	22
Pyup	pyup.io bot (via Travis CI)	1

As you can see all of them are bots.
The same picture for Renovateapp:

Company	AuthorName	Commits
Renovateapp	Renovate Bot	2348935
Renovateapp	WhiteSource Renovate	65148
Renovateapp	Renovate Bot (via Travis CI)	358
Renovateapp	renovate-bot	63
Renovateapp	Rhys Arkins	3

TravisCI (Top 10 by commits):

Company	AuthorName	Commits
Travis CI	Deployment Bot (from Travis CI)	426727
Travis CI	Travis CI	92799
Travis CI	travis-ci	11824
Travis CI	TravisCI	9511
Travis CI	Travis	8128
Travis CI	Deployment Bot (Travis)	7723
Travis CI	Deployment Bot	1917
Travis CI	raveit65	1322
Travis CI	Piotr Milcarz	1317
Travis CI	Travis Build Bot (from Travis CI)	1015

The biggest part of commits comming from bots

We would like a way to filter out these automated processes/bot commits, so that we could more accurately generate a ranking of companies based on commits.

One obvious way is to simply have a 'blacklist' of GitHub accounts / email addresses, but perhaps something more sophisticated could be devised, based on 'unhuman' levels of activity.

At the moment, we are using the domain <-> company match list, which filters companies from the top that we form. Perhaps the problem of bots can be solved by creating a similar list that will filter out bots.

Report issues - data does not add up

I found 2 issues on the local generated reports:

people appear in the contributor ranking report but it are missing from the repository commits
EX:
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep enderborg
Sony,peter enderborg,xxxxxxxxxxxxx@xxxxxxxxxx,57

cat Company-contributors-repository-commits_YTD_2022-01-31.csv | grep enderborg
returns nothing

Do you have any idea why some persons are missing ?

In the same report my contributions are counted separate for the same email address
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep Alin
Sony,Alin Jerpelea,xxxxxxxxxxxxx@xxxxxxxxxx,90
Sony,Alin,xxxxxxxxxxxxx@xxxxxxxxxx,56
(the email address is the same)

Thanks

Is https://opensourceindex.io/ not updated anymore?

Is the data on https://opensourceindex.io/ automatically updated? It still shows data from 2023, no new data from 2024.

Something was missed in Industry drop-down list

Hello,
I examine https://opensourceindex.io/

What I did:
-Click Industry menu filter
-Uncheck "select all"
-Scroll down
-Check empty line between lines "Healthcare & Pharma" and "Public Sector"

What I got:
-Line for FARFETCH organization.

What I expect:
-I don't have empty lines inside Industry drop-down menu filter

I check - osci/preprocess/match_company/company_domain_match_list.yaml
I assume that industry field was missed for FARFETCH company

Possible solutions:

Add industry field for FARFETCH with value: "Retail & Hospitality" or "Other"
Add rules fields in yaml file. Example: "we have set of mandatory fields, this fields cannot be empty"

Thanks.

Looking to obtain data in csv format

I am collaborating with Duane O'Brien on a Digital Infrastructure Project called "Fostering Open Collaboration" (FOCUSED) which is researching open source program offices within companies.

I would like to access the data on the Open Source Contributor Index (OSCI) website.
Source GitHub repo: https://github.com/epam/OSCI

I am looking for guidance on how to obtain it, or a link to the data that appears there, as a .CSV file.

Thank you.

cc: @DuaneOBrien

How to run OSCI in 2023?

Dear team,

i am trying to run a local osci instllation to get some stats. After failing installing tools in ubuntu 22.04 lts, i switched to 20.04 lts and got at least dependencies installed using pip and pyhton 3.8.

Now the first step of the pipeline is failing, i.e. running python3 osci-cli.py get-github-daily-push-events -d 2020-01-02

[2023-07-11 12:09:27,286] [ERROR] Failed to parse json: . Error: Expecting value: line 1 column 1 (char 0) [2023-07-11 12:09:27,329] [INFO] Save push events commits for 2020-01-02 00:00:00 into file /data/landing/github/events/push/2020/01/02/2020-01-02-0.parquet Traceback (most recent call last): File "osci-cli.py", line 93, in <module> cli(standalone_mode=False) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/sten/Desktop/OSCI/osci/actions/base.py", line 59, in execute return self._execute(**self._process_params(kwargs)) File "/home/sten/Desktop/OSCI/osci/actions/load/load.py", line 34, in _execute return get_github_daily_push_events(day=day) File "/home/sten/Desktop/OSCI/osci/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day) File "/home/sten/Desktop/OSCI/osci/datalake/local/landing.py", line 42, in save_push_events_commits log.info(f'Push events commits df info {get_pandas_data_frame_info(df)}') File "/home/sten/Desktop/OSCI/osci/utils.py", line 46, in get_pandas_data_frame_info df.info(buf=buf) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2497, in info mem_usage = self.memory_usage(index=True, deep=deep).sum() File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2590, in memory_usage result = Series(self.index.memory_usage(deep=deep), index=["Index"]).append( File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/series.py", line 305, in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/construction.py", line 465, in sanitize_array subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1452, in construct_1d_arraylike_from_scalar subarr = np.empty(length, dtype=dtype) TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

Would docker be a stable environment to run? My aim is to count github conributions based on some email-regexps.

Thanks!

Data inconsistency or update issues?

I wanted to check if there are some problems with the data? Did you change anything or reduce the frequency of the updates? Or maybe is it just a bug?

Here are my current findings:

Monthly report shows only January (Maybe too early for February?)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/monthly/2023/01.json (01 available)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/monthly/2023/02.json (02 not available)
Daily reports
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-01-11.json (01/11 available but no update)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-01-12.json (01/12 available but no update)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-01-13.json (01/13 available but no update)
- ...
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-01-31.json (01/31 available but no update)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-02-01.json (02/01 available but empty)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2023-02-02.json (02/02 not available)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2022-02-20.json (02/20 available)
- https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/osci-ranking/daily/2022-02-28.json (02/28 available but not displayed)

How to obtain country name for the records in the dataset

Hello,
I was looking for the country that the OSCI company would be headquartered in.
Would you know what's the best way to obtain this information?

Thank you.

For anyone else looking in on this issue we collated the data for OSCI into a series of files that are available for all to be downloaded here: https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/share/OSCI_change_ranking.zip

cc: @DuaneOBrien

MongoDB is misclassified as Open Source

https://opensourceindex.io/?company=mongoDB reports https://github.com/mongodb/mongo as a "top repo"

"Versions released prior to October 16, 2018 are published under the AGPL. All versions released after October 16, 2018, including patch fixes for prior versions, are published under the Server Side Public License (SSPL) v1"
Source: https://github.com/mongodb/mongo#license

Unable to filter by Industry for Feb 2022

Clicking on Industry to filter does not display any options.
Note: It works for Jan 2022.

SourceForge OSCI

The goal is to create and automate analysis of repos hosted on SourceForge (https://sourceforge.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

Solution that crawls data about push events commits (PEC) that should contain the following required fields:
- event creation date;
- commit author (email address, name);
- SHA.
Adapt existing pipeline to process SourceForge data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on SourceForge. This is a summary of our findings:

Criteria	Status (Yes/No)	Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects?	yes
Does it look like this site hosts many open source projects?	yes	“over 430,000 projects”Popular in open source community.BUT, it hosts a lot of binaries and mirrors of repos which are primarily hosted on github or elsewhere.
Size of user base	-	"we host over 3.7 million registered users”
Is there a public API we can query?	yes
API type	not studied yet
API URL	not studied yet
Query Limits (if any)	not studied yet
Is there a paid access with more information?	not studied yet
Is it possible to query the project license?	not studied yet
Is it possible to query commit events/commit counts by a user in a time period?	not studied yet
Is it possible to query email address or else some organization information for the person making a commit?	not studied yet
Is there a public archive we can use instead of the public API?	not studied yet
Any additional Information worth knowing?	not studied yet

Calculating ranking of an organization

Hello,
My understanding is that OSCI considers commits made by an organization AFTER it has been added and any commits BEFORE will NOT be considered for ranking.

Example: I add Societe Generale in September. OSCI does NOT consider any commits made by Soc Gen before September and the ranking is given based on commits made from September to date.

Is that right?

Improve identification of committers' organizations

The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.

We would like to improve the identification of committers organization using the data in their user profiles.

<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:

The domain is selected from the commiter's email;
Each domain is compared with the list of company domains (google.com, microsoft.com, etc) regardless of case;
If no match is found, a regular expression analysis is performed for situations with domains of 3 and higher levels.

If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:

The profile information on the user's Github is uploaded;
The website field is taken from the profile, if it is empty, then go to step 3. Otherwise, the basic algorithm is applied on the specified domain. If no matches occurred after applying the basic algorithm, go to step 3.
The company field is taken and compared with the list of companies that we are processing. (Fuzzy band algorithms were used: Levenshtein distance, Sorenson-Dice coefficient, etc)

Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.

It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.

Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.

Filter only on projects with open-source licenses

(Housekeeping - I move the original issue written here by @abitrolly into #8)

Enhance the OSCI algorithm to filter only projects with open-source licenses.
This will require some external datasets.

Use Postgres instead of MS SQL

Please, guys, don't be cynical.

OSCI/dbconnector.py

Line 30 in 67443b6

self.conn = pyodbc.connect('Driver={SQL Server};'

Measuring company support for known OSS public projects

While OSCI is primarily a reputation tool, it could actually do some good things if it could provide a ground for companies to support and compete in this user story.

What especially companies can do, if they like to support poetry in a way, is to give there employees a fixed time they can spend to support poetry. Having more people, who can contribute on a regular bases, would help us a lot. They don't necessarily need to start coding. Looking into the issue tracker and finding duplicates or outdated tickets or answer question there is important as well and would give us more time in actually fixing bugs or implement new features.

python-poetry/poetry#4160 (comment)

Bitbucket OSCI

The goal is to create and automate analysis of repos hosted on BitBucket. This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.

Solution that crawls data about push events commits (PEC) that should contain the following required fields:
- event creation date;
- commit author (email address, name);
- SHA.
Adapt existing pipeline to process Bitbucket data.

We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Bitbucket. This is a summary of our findings:

Criteria	Status (Yes/No)	Notes (e.g. about how it is possible, or limitations, etc)
Is this site free to use for open source projects?	yes	Seems to be free only for teams under 5 people unless you request a community license.
Does it look like this site hosts many open source projects?	unclear	It's not clear that there are large numbers of open source projects hosted.Most public projects seem to be non-commercial (not outsourced by companies). It appears most users do not use company domains - need to investigate further.The pages/repos of many companies appear to be not active.
Size of user base	-	In the order of 5,000,000 users
Is there a public API we can query?	yes
API type	REST
API URL	https://api.bitbucket.org/2.0
Query Limits (if any)	1,000 per hour 60,000 per hour
Is there a paid access with more information?	- (to be investigated)
Is it possible to query the project license?	Yes
Is it possible to query commit events/commit counts by a user in a time period?	Yes	/repositories?before=timestamp&after=timestamp e.g. https://api.bitbucket.org/2.0/repositories?after=2020-03-01T09%3A37%3A06.254721%2B00%3A00
Is it possible to query email address or else some organization information for the person making a commit?	Yes	email address
Is there a public archive we can use instead of the public API?	no

If you have additional questions, feel free to contact our team.

Sponsorship clarity

In this paragraph - https://github.com/epam/OSCI#where-can-i-see-the-latest-rankings

This project is sponsored by EPAM Systems and the latest results are visible on the EPAM SolutionsHub OSCI page.

It is better to clarify that no sponsorship is available to outside contributors.

Unusual spike in data? (starting Nov 21)

While visualizing some of the data OSCI provides I noticed an unusual spike in the data starting end of October/beginning of November 2021 and potentially still ongoing. I was wondering, if there was a change to how active contributors or community are measured?

This graph shows the development for Google over the last 3 years. Besides the usual spikes at the beginning of each year, there is an additional spike in November and after that the daily increase seems to be higher than in previous periodes as well.

The spike is there for different companies and not specific for the above example.

Company missing from the latest report

Hi and thanks for this project :)

We recently raised a PR to add Expedia Group to the list.
Unfortunately we couldn't find the company in the latest report published today. Have we missed anything in the PR or we can expect to make it in the next report?

Exclude company's own projects filter

I think it would be pertinent to include a filter that excludes contributions to the company's own open source projects.
As much as I enjoy seeing the numbers I feel like it would be amazing to see which companies contribute outside of their own circle of influence the most, this could shift the rankings somewhat and showcase a bit more of the open source community on the top lists.

Throwing this out there as an idea, absolutely understand if this is not relevant to this project but maybe something worth thinking about!

SmartBear not appearing in the list

As mentioned in #122, I'm puzzled why SmartBear does not appear in the list, now that we're into the month of April.

We were added in we were added in v2022.03.0.

I'm not sure if this is user error on my part, so please let me know if there's something I'm misunderstanding, or we need to make some additional configuration.

Here's an example of a commit that I think should have been counted: https://github.com/cucumber/cucumber-js/commits/d98c6deabd39e1adb8e52a1a65324662143108e8

Query on ranking

Hello,
My organization (Societe Generale) as on Feb 2022 seems to have total community number as 7 (+5) and 0 active contributors. Also it is indicated that we are down by 3 places.
I'd like to know why we are placed at 284 and not 276?

When the no of active contributors are equal, what are considered while ranking?
Please can you help with this query?

Extension of analytics scope (Add licenses and programming languages)

Plans

We plan to expand the scope of research.

We want to add two new reports:

OSCI_Languages_YTD: report on the number of the company commits in the programming language since the beginning of the year.
OSCI_Licenses_YTD: report on the number of the company commits in the repository with a license since the beginning of the year.

TODO

OSCI Languages YTD

create transformation function, which gets push events commits as input and returns the amount of commits report grouped by company and language;
create spark job;
create cli command for this job;
add job to daily-osci-rankings cli command.

Example output:

company	language	commits
Google	python	50
Google	go	30
Microsoft	typescript	40
Microsoft	powershell	20

OSCI Licenses YTD

create transformation function, which gets push events commits as input and returns the amount of commits report grouped by company and license
create spark job
create cli command for this job
add job to daily-osci-rankings cli command

Example output:

company	license	commits
Google	apache-2.0	50
Google	mit	30
Microsoft	gpl-3.0	40
Microsoft	lgpl-2.1	20

Active contributor : query

Hi,
Say if developer X makes 10 commits in Jan 2021 and does another 10 in the next month i.e. Feb 2021 and there is developer Y who makes 10 commits in the month of Feb 2021 only.

Does the active contributor number for the organization show 2 or 3?

I assume it is 2.
Please can someone help confirm.

epam / osci Goto Github PK

osci's Issues

Plans

TODO

Company contributors repository commits

OSCI Contributors YTD

Plans

TODO

OSCI Languages YTD

OSCI Licenses YTD

Recommend Projects

Recommend Topics

Recommend Org