epam / osci Goto Github PK
View Code? Open in Web Editor NEWOpen Source Contributor Index
Home Page: https://opensourceindex.io/
License: GNU General Public License v3.0
Open Source Contributor Index
Home Page: https://opensourceindex.io/
License: GNU General Public License v3.0
The https certificate for opensourceindex.io expired yesterday
It would be neat to see the total number of repositories every organization from the index was contributing to.
If a user makes a contribution to a branch or a fork of a repository but these contributions are not merged Into master or original repository, is this still considered to be an Active Contribution?
when i run the command below
python3 osci.py daily-osci-rankings -td 2020-01-02
i get the following error
Traceback (most recent call last):
File "osci.py", line 44, in <module>
from cli import company_rankers
File "/home/mirai/OSCI/cli/company_rankers.py", line 23, in <module>
from __app__.jobs.contributors_ranking import ContributorsRankingJob
File "/home/mirai/OSCI/__app__/jobs/contributors_ranking.py", line 17, in <module>
from pyspark.sql import DataFrame
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/__init__.py", line 51, in <module>
from pyspark.context import SparkContext
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/context.py", line 31, in <module>
from pyspark import accumulators
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/accumulators.py", line 97, in <module>
from pyspark.serializers import read_int, PickleSerializer
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/serializers.py", line 72, in <module>
from pyspark import cloudpickle
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 145, in <module>
_cell_set_template_code = _make_cell_set_template_code()
File "/home/mirai/.local/lib/python3.8/site-packages/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
return types.CodeType(
TypeError: an integer is required (got type bytes)
os: Ubuntu 20.04.1 LTS
python : Python 3.8.5
pip: 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)
any ideas whats causing this ?
Currently both the list of companies and emails is hardcoded. At the very least as instruction how to build the stats for your company or employee can be added.
There is a concern that for other companies the requirements to use corporate email for contributions is not so strict, especially if contributors or maintainers are hired or sponsored by corporations to do some jobs.
Adobe announced that it was acquiring Magento for $1.68 billion.
Magento.com site is now a blog under Adobe.
They announce proudly that they are an Adobe Company.
We suggest to merge them
Hello,
What is the policy for addition of subsidiaries to the OSCI index ?
Is it a decision of parent company or of EPAM/OSCI on how to list the email domains ?
Should all subsidiaries of a company be under same umbrella (and index calculation) of parent company
OR
For each subsidiary with a different email domain, a new company addition (and index calculation) with a subsidiary name should be made ?
For example, a company "X" has 2 subsidiaries "Y" and "Z"
Option A:
company: X
domains:
- X.com
- Y.com
- Z.com
Option B:
company: X
domains:
- X.com
company: Y
domains:
- Y.com
company: Z
domains:
- Z.com
Which would be acceptable ?
https://opensourceindex.io/?company=Elastic reports https://github.com/elastic/kibana/ as a "top repo"
But Kibana is no longer Open Source since 2021-02
elastic/kibana#90099
The forks are also misclassified:
I am trying to run the example commands, and there doesn't seem to be a /data folder, or the permissions for it are wrong. Am I missing something? Thank you.
Error:
➜ OSCI git:(master) python3 osci.py get-github-daily-push-events -d 2020-01-01
[2021-02-11 17:55:21,671] [INFO] ENV: None
[2021-02-11 17:55:21,671] [DEBUG] Check config file for env local exists
[2021-02-11 17:55:21,671] [DEBUG] Read config from /Users/richard/src/OSCI/__app__/config/files/local.yml
[2021-02-11 17:55:21,674] [INFO] Configuration loaded for env: local
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.LocalFileSystemConfig'>
[2021-02-11 17:55:21,674] [DEBUG] Create new <class '__app__.config.base.Config'>
[2021-02-11 17:55:21,676] [DEBUG] Create new <class '__app__.datalake.datalake.DataLake'>
[2021-02-11 17:55:21,681] [INFO] Crawl events for 2020-01-01 00:00:00
[2021-02-11 17:55:21,681] [INFO] Load events for date: 2020-01-01 00:00:00
[2021-02-11 17:55:21,691] [DEBUG] Starting new HTTPS connection (1): data.gharchive.org:443
[2021-02-11 17:55:21,968] [DEBUG] https://data.gharchive.org:443 "GET /2020-01-01-0.json.gz HTTP/1.1" 200 15670114
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01/01'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020/01'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push/2020'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events/push'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github/events'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing/github'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/landing'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "osci.py", line 75, in <module>
cli(standalone_mode=False)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/Users/richard/src/OSCI/cli/gharchive.py", line 36, in get_github_daily_push_events
gharchive.get_github_daily_push_events(day=day)
File "/Users/richard/src/OSCI/__app__/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events
DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day)
File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 39, in save_push_events_commits
file_path = self._get_hourly_push_events_commits_path(date)
File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 73, in _get_hourly_push_events_commits_path
return self.get_push_events_commits_parent_dir(date=date, create_if_not_exists=True) / \
File "/Users/richard/src/OSCI/__app__/datalake/local/landing.py", line 69, in get_push_events_commits_parent_dir
path.mkdir(parents=True, exist_ok=True)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
self.parent.mkdir(parents=True, exist_ok=True)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
self.parent.mkdir(parents=True, exist_ok=True)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1277, in mkdir
self.parent.mkdir(parents=True, exist_ok=True)
[Previous line repeated 4 more times]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/pathlib.py", line 1273, in mkdir
self._accessor.mkdir(self, mode)
OSError: [Errno 30] Read-only file system: '/data'
We plan to expand the scope of research.
We want to add two new reports:
Company-contributors-repository-commits
: a report on the number of the company commits in the repository (with information about the license and programming language) per day (this data is required to record in Google BQ).OSCI_Contributors_YTD
: report on the number of commits by company employees since the beginning of the year.daily-osci-rankings
cli command.Example output:
author_email | author_name | repo_name | language | license | company | commits |
---|---|---|---|---|---|---|
[email protected] | Lorem | datalayer-contrib/hadoop | Java | apache-2.0 | Cloudera | 3 |
[email protected] | Ipsum | docker-library/docs | Shell | mit | Infosiftr | 200 |
[email protected] | Dolor | golang/go | Go | other | 12 | |
[email protected] | Sit | konnectors/darty | JavaScript | agpl-3.0 | Renovateapp | 4 |
daily-osci-rankings
cli commandExample output:
company | author | author_email | commits |
---|---|---|---|
Lorem | [email protected] | 50 | |
Ipsum | [email protected] | 30 | |
Dolor | [email protected] | 20 | |
Microsoft | Sit | [email protected] | 40 |
Microsoft | Amet | [email protected] | 20 |
Microsoft | Consectetur | [email protected] | 10 |
Looks like some files in this repo were copied from Mypy project (471ccc3). It would be nice to see that Epam actually fulfilled all License obligations towards open source projects it is using, giving proper credits where due.
The goal is to create and automate analysis of repos hosted on LaunchPad (https://launchpad.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.
We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on LaunchPad. This is a summary of our findings:
Criteria | Status (Yes/No) | Notes (e.g. about how it is possible, or limitations, etc) |
---|---|---|
Is this site free to use for open source projects? | yes | |
Does it look like this site hosts many open source projects? | yes | In total (all projects): “43,314 projects, 1,808,413 bugs, 1,004,760 branches, 17,009 Git repositories, 2,977,004 translations, 685,925 answers, 77,280 blueprints, and counting...” https://launchpad.net/projects/+all?batch=75Seems to be popular for Ubuntu community mainly.many repos appear to be mirrors of projects which are hosted elsewhere - need more data to provea lot of linux-focused projects |
Size of user base | - | c. 4,000,000 |
Is there a public API we can query? | yes | |
API type | HTTP | |
API URL | http://api.launchpad.net/1.0/ | |
Query Limits (if any) | - (to be investigated) | |
Is there a paid access with more information? | - (to be investigated) | |
Is it possible to query the project license? | Yes | |
Is it possible to query commit events/commit counts by a user in a time period? | Yes | |
Is it possible to query email address or else some organization information for the person making a commit? | Yes | email address (required an authentication) |
Is there a public archive we can use instead of the public API? | no |
Am I right in assuming that these are the steps you take to count the open source contributions?
I just wanted to clarify so I better understand how to interpret the results. Great project.
When the YEARLY data is chosen, the list shows the same number of ACTIVE CONTRIBUTORS and TOTAL COMMUNITY per year and company then for January of the same year. Therefor the data seems to be wrong and at least the ranking by year could be inaccurate.
Could this problem be fixed?
Hey, folks --
I'm having trouble getting the basic example provided to run. Specifically the failure I'm encountering is at the daily-osci-rankings
stage. I have confirmed that I have a functioning local version of Hadoop installed. Running on Ubuntu 20.04 LTS VPS with a fresh install.
I pulled the two most visible errors from the log out below (full log expandable at bottom of issue). It's unclear to me if they are related though.
Any help pointing me in the right direction would be appreciated!
$ python3 osci-cli.py get-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py process-github-daily-push-events -d 2020-01-01
# success
$ python3 osci-cli.py daily-osci-rankings -td 2020-01-02
# failure (see full log below)
# ...
[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
# ...
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
File "osci-cli.py", line 93, in <module>
cli(standalone_mode=False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
return self._execute(**self._process_params(kwargs))
File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
commits = osci_ranking_job.extract(to_date=to_day).cache()
File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
return self.spark_session.read.load(paths, **options)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
[2022-03-22 18:11:05,996] [INFO] ENV: None
[2022-03-22 18:11:05,997] [DEBUG] Check config file for env local exists
[2022-03-22 18:11:05,997] [DEBUG] Read config from /home/ubuntu/OSCI/osci/config/files/local.yml
[2022-03-22 18:11:06,000] [DEBUG] Prod yml load: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [DEBUG] Prod yml res: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Full config: {'meta': {'config_source': 'yaml'}, 'file_system': {'type': 'local', 'base_path': '/data'}, 'areas': {'landing': {'container': 'landing'}, 'staging': {'container': 'staging'}, 'public': {'container': 'public'}}, 'bq': {'project': '', 'secret': '{}'}, 'web': {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}, 'github': {'token': ''}, 'company': {'default': 'EPAM'}}
[2022-03-22 18:11:06,000] [INFO] Configuration loaded for env: local
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.LocalFileSystemConfig'>
[2022-03-22 18:11:06,000] [DEBUG] {'fs': 'local', 'base_path': '/web', 'account_name': '', 'account_key': '', 'container': 'data'}
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.config.base.Config'>
[2022-03-22 18:11:06,000] [DEBUG] Create new <class 'osci.datalake.datalake.DataLake'>
[2022-03-22 18:11:06,113] [INFO] Execute action `daily-osci-rankings`
[2022-03-22 18:11:06,113] [INFO] Action params `{'to_day': '2020-01-02'}`
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.osci_ranking.OSCIRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.datalake.reports.general.commits_ranking.OSCICommitsRankingFactory'>
[2022-03-22 18:11:06,114] [DEBUG] Create new <class 'osci.jobs.session.Session'>
[2022-03-22 18:11:06,115] [DEBUG] Loaded paths for (None 2020-01-02 00:00:00) []
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[2022-03-22 18:11:08,127] [DEBUG] Command to send: A
fb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
[2022-03-22 18:11:08,142] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,142] [DEBUG] Command to send: j
i
rj
org.apache.spark.SparkConf
e
[2022-03-22 18:11:08,143] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,143] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.java.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.api.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.ml.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.mllib.api.python.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.*
e
[2022-03-22 18:11:08,144] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,144] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.api.python.*
e
[2022-03-22 18:11:08,145] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,145] [DEBUG] Command to send: j
i
rj
org.apache.spark.sql.hive.*
e
[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: j
i
rj
scala.Tuple2
e
[2022-03-22 18:11:08,146] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,146] [DEBUG] Command to send: r
u
SparkConf
rj
e
[2022-03-22 18:11:08,147] [DEBUG] Answer received: !ycorg.apache.spark.SparkConf
[2022-03-22 18:11:08,148] [DEBUG] Command to send: i
org.apache.spark.SparkConf
bTrue
e
[2022-03-22 18:11:08,154] [DEBUG] Answer received: !yro0
[2022-03-22 18:11:08,154] [DEBUG] Command to send: c
o0
contains
sspark.serializer.objectStreamReset
e
[2022-03-22 18:11:08,158] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,158] [DEBUG] Command to send: c
o0
set
sspark.serializer.objectStreamReset
s100
e
[2022-03-22 18:11:08,158] [DEBUG] Answer received: !yro1
[2022-03-22 18:11:08,158] [DEBUG] Command to send: m
d
o1
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
contains
sspark.rdd.compress
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,159] [DEBUG] Command to send: c
o0
set
sspark.rdd.compress
sTrue
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yro2
[2022-03-22 18:11:08,159] [DEBUG] Command to send: m
d
o2
e
[2022-03-22 18:11:08,159] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
contains
sspark.master
e
[2022-03-22 18:11:08,160] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,160] [DEBUG] Command to send: c
o0
get
sspark.master
e
[2022-03-22 18:11:08,161] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,161] [DEBUG] Command to send: c
o0
contains
sspark.app.name
e
[2022-03-22 18:11:08,162] [DEBUG] Answer received: !ybtrue
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
get
sspark.app.name
e
[2022-03-22 18:11:08,162] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,162] [DEBUG] Command to send: c
o0
contains
sspark.home
e
[2022-03-22 18:11:08,163] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:08,163] [DEBUG] Command to send: c
o0
getAll
e
[2022-03-22 18:11:08,163] [DEBUG] Answer received: !yto3
[2022-03-22 18:11:08,163] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,164] [DEBUG] Command to send: a
g
o3
i0
e
[2022-03-22 18:11:08,164] [DEBUG] Answer received: !yro4
[2022-03-22 18:11:08,164] [DEBUG] Command to send: c
o4
_1
e
[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysspark.rdd.compress
[2022-03-22 18:11:08,165] [DEBUG] Command to send: c
o4
_2
e
[2022-03-22 18:11:08,165] [DEBUG] Answer received: !ysTrue
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,166] [DEBUG] Command to send: a
g
o3
i1
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !yro5
[2022-03-22 18:11:08,166] [DEBUG] Command to send: c
o5
_1
e
[2022-03-22 18:11:08,166] [DEBUG] Answer received: !ysspark.serializer.objectStreamReset
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o5
_2
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !ys100
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,167] [DEBUG] Command to send: a
g
o3
i2
e
[2022-03-22 18:11:08,167] [DEBUG] Answer received: !yro6
[2022-03-22 18:11:08,167] [DEBUG] Command to send: c
o6
_1
e
[2022-03-22 18:11:08,170] [DEBUG] Answer received: !ysspark.master
[2022-03-22 18:11:08,170] [DEBUG] Command to send: c
o6
_2
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yslocal[*]
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: a
g
o3
i3
e
[2022-03-22 18:11:08,171] [DEBUG] Answer received: !yro7
[2022-03-22 18:11:08,171] [DEBUG] Command to send: c
o7
_1
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ysspark.submit.pyFiles
[2022-03-22 18:11:08,172] [DEBUG] Command to send: c
o7
_2
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !ys
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,172] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,172] [DEBUG] Command to send: a
g
o3
i4
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro8
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_1
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysspark.submit.deployMode
[2022-03-22 18:11:08,173] [DEBUG] Command to send: c
o8
_2
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !ysclient
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,173] [DEBUG] Command to send: a
g
o3
i5
e
[2022-03-22 18:11:08,173] [DEBUG] Answer received: !yro9
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_1
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ysspark.ui.showConsoleProgress
[2022-03-22 18:11:08,174] [DEBUG] Command to send: c
o9
_2
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !ystrue
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,174] [DEBUG] Command to send: a
g
o3
i6
e
[2022-03-22 18:11:08,174] [DEBUG] Answer received: !yro10
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_1
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !ysspark.app.name
[2022-03-22 18:11:08,175] [DEBUG] Command to send: c
o10
_2
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yspyspark-shell
[2022-03-22 18:11:08,175] [DEBUG] Command to send: a
e
o3
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yi7
[2022-03-22 18:11:08,175] [DEBUG] Command to send: m
d
o3
e
[2022-03-22 18:11:08,175] [DEBUG] Answer received: !yv
[2022-03-22 18:11:08,175] [DEBUG] Command to send: r
u
JavaSparkContext
rj
e
[2022-03-22 18:11:08,186] [DEBUG] Answer received: !ycorg.apache.spark.api.java.JavaSparkContext
[2022-03-22 18:11:08,186] [DEBUG] Command to send: i
org.apache.spark.api.java.JavaSparkContext
ro0
e
[2022-03-22 18:11:09,483] [DEBUG] Answer received: !yro11
[2022-03-22 18:11:09,483] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,489] [DEBUG] Answer received: !yro12
[2022-03-22 18:11:09,490] [DEBUG] Command to send: c
o12
conf
e
[2022-03-22 18:11:09,499] [DEBUG] Answer received: !yro13
[2022-03-22 18:11:09,500] [DEBUG] Command to send: r
u
PythonAccumulatorV2
rj
e
[2022-03-22 18:11:09,501] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonAccumulatorV2
[2022-03-22 18:11:09,502] [DEBUG] Command to send: i
org.apache.spark.api.python.PythonAccumulatorV2
s127.0.0.1
i45879
sfb324a0d50b599ec733f3b3b1bc1d7f4d1c894100f14d4ad6f4af9db025d37ea
e
[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro14
[2022-03-22 18:11:09,502] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,502] [DEBUG] Answer received: !yro15
[2022-03-22 18:11:09,503] [DEBUG] Command to send: c
o15
register
ro14
e
[2022-03-22 18:11:09,505] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,505] [DEBUG] Command to send: r
u
PythonUtils
rj
e
[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:09,506] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
e
[2022-03-22 18:11:09,506] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,506] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
isEncryptionEnabled
ro11
e
[2022-03-22 18:11:09,507] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,508] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,509] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,510] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,510] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
u
org.apache.spark.SparkFiles
rj
e
[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ycorg.apache.spark.SparkFiles
[2022-03-22 18:11:09,511] [DEBUG] Command to send: r
m
org.apache.spark.SparkFiles
getRootDirectory
e
[2022-03-22 18:11:09,511] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,511] [DEBUG] Command to send: c
z:org.apache.spark.SparkFiles
getRootDirectory
e
[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/userFiles-58b63090-eb7f-4872-8939-2710678287d1
[2022-03-22 18:11:09,512] [DEBUG] Command to send: c
o13
get
sspark.submit.pyFiles
s
e
[2022-03-22 18:11:09,512] [DEBUG] Answer received: !ys
[2022-03-22 18:11:09,513] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,514] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,514] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,515] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:09,515] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,516] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:09,517] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,517] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
getLocalDir
e
[2022-03-22 18:11:09,519] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,519] [DEBUG] Answer received: !yro16
[2022-03-22 18:11:09,519] [DEBUG] Command to send: c
o16
conf
e
[2022-03-22 18:11:09,520] [DEBUG] Answer received: !yro17
[2022-03-22 18:11:09,520] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
getLocalDir
ro17
e
[2022-03-22 18:11:09,520] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
[2022-03-22 18:11:09,520] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:09,521] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,521] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:09,522] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,522] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !yp
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:09,523] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
createTempDir
e
[2022-03-22 18:11:09,523] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
createTempDir
s/tmp/spark-133764be-4844-4a91-a340-210c1b419fda
spyspark
e
[2022-03-22 18:11:09,524] [DEBUG] Answer received: !yro18
[2022-03-22 18:11:09,524] [DEBUG] Command to send: c
o18
getAbsolutePath
e
[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ys/tmp/spark-133764be-4844-4a91-a340-210c1b419fda/pyspark-bc66966b-69a0-4a5b-b7ab-b0b7c8e45101
[2022-03-22 18:11:09,525] [DEBUG] Command to send: c
o13
get
sspark.python.profile
sfalse
e
[2022-03-22 18:11:09,525] [DEBUG] Answer received: !ysfalse
[2022-03-22 18:11:09,525] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,544] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,545] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
getDefaultSession
e
[2022-03-22 18:11:09,567] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,567] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
getDefaultSession
e
[2022-03-22 18:11:09,568] [DEBUG] Answer received: !yro19
[2022-03-22 18:11:09,568] [DEBUG] Command to send: c
o19
isDefined
e
[2022-03-22 18:11:09,569] [DEBUG] Answer received: !ybfalse
[2022-03-22 18:11:09,569] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,570] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,570] [DEBUG] Command to send: c
o11
sc
e
[2022-03-22 18:11:09,571] [DEBUG] Answer received: !yro20
[2022-03-22 18:11:09,571] [DEBUG] Command to send: i
org.apache.spark.sql.SparkSession
ro20
e
[2022-03-22 18:11:09,620] [DEBUG] Answer received: !yro21
[2022-03-22 18:11:09,620] [DEBUG] Command to send: c
o21
sqlContext
e
[2022-03-22 18:11:09,621] [DEBUG] Answer received: !yro22
[2022-03-22 18:11:09,621] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,622] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,622] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setDefaultSession
e
[2022-03-22 18:11:09,623] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,623] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setDefaultSession
ro21
e
[2022-03-22 18:11:09,623] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,623] [DEBUG] Command to send: r
u
SparkSession
rj
e
[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ycorg.apache.spark.sql.SparkSession
[2022-03-22 18:11:09,624] [DEBUG] Command to send: r
m
org.apache.spark.sql.SparkSession
setActiveSession
e
[2022-03-22 18:11:09,624] [DEBUG] Answer received: !ym
[2022-03-22 18:11:09,624] [DEBUG] Command to send: c
z:org.apache.spark.sql.SparkSession
setActiveSession
ro21
e
[2022-03-22 18:11:09,625] [DEBUG] Answer received: !yv
[2022-03-22 18:11:09,625] [DEBUG] Command to send: c
o22
read
e
[2022-03-22 18:11:10,432] [DEBUG] Answer received: !yro23
[2022-03-22 18:11:10,432] [DEBUG] Command to send: r
u
PythonUtils
rj
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ycorg.apache.spark.api.python.PythonUtils
[2022-03-22 18:11:10,433] [DEBUG] Command to send: r
m
org.apache.spark.api.python.PythonUtils
toSeq
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ym
[2022-03-22 18:11:10,433] [DEBUG] Command to send: i
java.util.ArrayList
e
[2022-03-22 18:11:10,433] [DEBUG] Answer received: !ylo24
[2022-03-22 18:11:10,434] [DEBUG] Command to send: c
z:org.apache.spark.api.python.PythonUtils
toSeq
ro24
e
[2022-03-22 18:11:10,434] [DEBUG] Answer received: !yro25
[2022-03-22 18:11:10,434] [DEBUG] Command to send: m
d
o24
e
[2022-03-22 18:11:10,435] [DEBUG] Answer received: !yv
[2022-03-22 18:11:10,435] [DEBUG] Command to send: c
o23
load
ro25
e
22/03/22 18:11:10 WARN DataSource: All paths were ignored:
[Stage 0:> (0 + 1) / 1]
[2022-03-22 18:11:11,839] [DEBUG] Answer received: !xro26
[2022-03-22 18:11:11,839] [DEBUG] Command to send: c
o26
toString
e
[2022-03-22 18:11:11,840] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,840] [DEBUG] Command to send: c
o26
getCause
e
[2022-03-22 18:11:11,840] [DEBUG] Answer received: !yn
[2022-03-22 18:11:11,840] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:11,842] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,842] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:11,844] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,844] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:11,848] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,848] [DEBUG] Command to send: r
u
org.apache.spark.util
rj
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
u
org.apache.spark.util.Utils
rj
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ycorg.apache.spark.util.Utils
[2022-03-22 18:11:11,849] [DEBUG] Command to send: r
m
org.apache.spark.util.Utils
exceptionString
e
[2022-03-22 18:11:11,849] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,849] [DEBUG] Command to send: c
z:org.apache.spark.util.Utils
exceptionString
ro26
e
[2022-03-22 18:11:11,850] [DEBUG] Answer received: !ysorg.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;\n at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)\n at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)\n at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)\n at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)\n at scala.Option.getOrElse(Option.scala:189)\n at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)\n at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n at java.lang.reflect.Method.invoke(Method.java:498)\n at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n at py4j.Gateway.invoke(Gateway.java:282)\n at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n at py4j.commands.CallCommand.execute(CallCommand.java:79)\n at py4j.GatewayConnection.run(GatewayConnection.java:238)\n at java.lang.Thread.run(Thread.java:748)\n
<osci.datalake.local.landing.LocalLandingArea object at 0x7fa5e8753f40> /data landing
<osci.datalake.local.staging.LocalStagingArea object at 0x7fa5e87609a0> /data staging
<osci.datalake.local.public.LocalPublicArea object at 0x7fa5e8760940> /data public
<osci.datalake.local.web.LocalWebArea object at 0x7fa5e8760a90> /web data
[2022-03-22 18:11:11,852] [DEBUG] Command to send: m
d
o0
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o4
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o5
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o6
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o7
e
[2022-03-22 18:11:11,853] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,853] [DEBUG] Command to send: m
d
o8
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o9
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o10
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o12
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o15
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,854] [DEBUG] Command to send: m
d
o16
e
[2022-03-22 18:11:11,854] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o17
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o18
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o19
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,855] [DEBUG] Command to send: m
d
o20
e
[2022-03-22 18:11:11,855] [DEBUG] Answer received: !yv
Traceback (most recent call last):
File "osci-cli.py", line 93, in <module>
cli(standalone_mode=False)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/OSCI/osci/actions/base.py", line 59, in execute
return self._execute(**self._process_params(kwargs))
File "/home/ubuntu/OSCI/osci/actions/process/generate_daily_osci_rankings.py", line 49, in _execute
commits = osci_ranking_job.extract(to_date=to_day).cache()
File "/home/ubuntu/OSCI/osci/jobs/base.py", line 44, in extract
commits=Session().load_dataframe(paths=self._get_dataset_paths(to_date, from_date))
File "/home/ubuntu/OSCI/osci/jobs/session.py", line 39, in load_dataframe
return self.spark_session.read.load(paths, **options)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 182, in load
return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/ubuntu/.local/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/home/ubuntu/.local/lib/python3.8/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.AnalysisException[2022-03-22 18:11:11,879] [DEBUG] Command to send: r
u
org
rj
e
[2022-03-22 18:11:11,881] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,881] [DEBUG] Command to send: r
u
org.apache
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql
rj
e
[2022-03-22 18:11:11,882] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,882] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal
rj
e
[2022-03-22 18:11:11,883] [DEBUG] Answer received: !yp
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
u
org.apache.spark.sql.internal.SQLConf
rj
e
[2022-03-22 18:11:11,883] [DEBUG] Answer received: !ycorg.apache.spark.sql.internal.SQLConf
[2022-03-22 18:11:11,883] [DEBUG] Command to send: r
m
org.apache.spark.sql.internal.SQLConf
get
e
[2022-03-22 18:11:11,885] [DEBUG] Answer received: !ym
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
z:org.apache.spark.sql.internal.SQLConf
get
e
[2022-03-22 18:11:11,885] [DEBUG] Answer received: !yro27
[2022-03-22 18:11:11,885] [DEBUG] Command to send: c
o27
pysparkJVMStacktraceEnabled
e
[2022-03-22 18:11:11,886] [DEBUG] Answer received: !ybfalse
: Unable to infer schema for Parquet. It must be specified manually.;
[2022-03-22 18:11:11,924] [DEBUG] Command to send: m
d
o27
e
[2022-03-22 18:11:11,927] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,965] [DEBUG] Command to send: m
d
o26
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o25
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
[2022-03-22 18:11:11,966] [DEBUG] Command to send: m
d
o23
e
[2022-03-22 18:11:11,966] [DEBUG] Answer received: !yv
https://opensourceindex.io/?company=HashiCorp reports some projects in "top repo"
"HashiCorp, the vendor of Vagrant, Terraform, and a number of other deployment-automation tools, is changing its software license to the Business Source License." Source: https://www.theregister.com/2023/08/11/hashicorp_bsl_licence/
The goal is to identify the license used by GitHub repos where these are not classified automatically by GitHub. As stated in the GitHub API documentation https://developer.github.com/v3/licenses/, the open source Ruby Gem Licensee (https://github.com/licensee/licensee) is used to identify the license type.
Licensee automates the process of reading LICENSE files and compares their contents to known licenses using a several strategies (which we call "Matchers"). It attempts to determine a project's license in the following order:
We published a blog with an analysis of open source licenses used on GitHub (https://solutionshub.epam.com/blog/post/examining-open-source-license-usage) but it includes a lot of repos which have a "custom" license. Our manual analysis of a selection of these shows that many organizations use licenses which are minor modifications of standard licenses, but GitHub does not automatically identify their license. If this could be improved, we would be able to make a better analysis of popularity of open source repos among commercial organizations as well as across all of GitHub.
The goal is to identify such licenses with a high level of probability, e.g. content of the LICENSE file is 90+% the same as the standard Apache license text.
Our suggestion would be do this for the top 3-6 license types, which based on our analysis are Apache 2.0, BSD 3-clause, MIT, GPL 2.0, GPL 3.0, LGPKL 2.1, EPM 1.0.
Some examples:
https://github.com/MicrosoftDocs/microsoft-365-docs/blob/public/LICENSE - Creative Commons license
https://github.com/dotnet/runtime/blob/master/LICENSE.TXT - MIT license
https://github.com/IBM-Cloud/webapp-with-cos-and-cdn/blob/master/License.txt, https://github.com/IBM-Cloud/serverless-followupapp-ios/blob/master/License.txt- Apache 2.0
https://github.com/strongloop/loopback.io/blob/gh-pages/LICENSE, https://github.com/strongloop/loopback-next/blob/master/LICENSE - MIT license
https://github.com/mono/mono/blob/master/LICENSE - mix of licenses, so won't be possible to identify a single license type
@patrickstephens1 adding Travis CI or Cirrus CI to this repo, will help to see if my PR for #2 doesn't break anything.
The goal is to create and automate analysis of repos hosted on Savannah (https://savannah.gnu.org/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.
We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Savannah. This is a summary of our findings:
Criteria | Status (Yes/No) | Notes (e.g. about how it is possible, or limitations, etc) |
---|---|---|
Is this site free to use for open source projects? | yes | |
Does it look like this site hosts many open source projects? | yes | In total (all projects): "23990 registered users, 3829 hosted projects" |
Is there a public API we can query? | no | however, we can parse HTML pages |
API type | - | |
API URL | - | |
Query Limits (if any) | - | |
Is there a paid access with more information? | - | |
Is it possible to query the project license? | Yes | by parsing HTML page |
Is it possible to query commit events/commit counts by a user in a time period? | no | |
Is it possible to query email address or else some organization information for the person making a commit? | Yes | by parsing HTML pageemail address |
Is there a public archive we can use instead of the public API? | no | |
Any additional Information worth knowing? | yes | it is possible to get information about commits with using parsing web pages if project is based on GIT |
OSCI software uses MS SQL specific instructions to load data from a file on a local filesystem. The error is easily repeatable when using MS SQL on different hosts or in CI/CD container - https://travis-ci.com/epam/OSCI/builds/148387944#L369
[SQL Server]Cannot bulk load. The file "/home/travis/build/epam/OSCI/resources/2019-01-01-9-formatted.json" does not exist or you don\'t have file access rights. (4860)
Blocker for porting code to an open source database (#2) and for integration tests.
Line 24 in b42ee51
Proposed by @abitrolly.
Add the ability to filter companies by regions (created after discussion in #5 and #6)
Country of registration
Country of presence
Country of origin (native perception)
This will require some external datasets.
Hi,
Similar to #118, I'm eagerly looking for my company's stats which I think should appear for March (we were added in v2022.03.0) but I don't see March in the drop-down...
The goal is to create and automate analysis of repos hosted on GitLab (https://gitlab.com). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.
We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Gitlab. This is a summary of our findings:
Criteria | Status (Yes/No) | Notes (e.g. about how it is possible, or limitations, etc) |
---|---|---|
Is this site free to use for open source projects? | yes | Seems to be free with basic features (they sell more advanced CI/CD features) |
Does it look like this site hosts many open source projects? | yes | |
Is there a public API we can query? | yes | |
API type | REST | |
API URL | https://gitlab.com/api/v4/ | |
Query Limits (if any) | 600 queries per 60 second period | |
Is there a paid access with more information? | no | |
Is it possible to query the project license? | no | |
Is it possible to query commit events/commit counts by a user in a time period? | no | |
Is it possible to query email address or else some organization information for the person making a commit? | yes | email address |
Is there a public archive we can use instead of the public API? | no | |
Any additional Information worth knowing? | no |
Releasing code on GitHub doesn't mean that the project will be properly maintained once company looses interest in it. Project survivability is better if in addition to "Open Source as a Distribution Model" the project also exhibits open governance and participation model, and helps people to socialize. In contrast, many companies who release code in open, are not interested to support external contributions and communicate with general public. Depending on such projects in a long term is risky.
The metric that shows the commitment of companies towards OS Distribution Model rather than open governance and development, could help not only with evaluation a solution as to be sustainable, but also with help with drafting more effective Open Source Policies in companies.
For researchers and initiatives, such as SustainOSS, it will also be beneficial to get deeper analysis into survivability and inclusiveness of projects with and without companies support. To draft the best practices is it necessary to know how many companies are committing only into their own repos, and how many of them collaborate with other companies and individual maintainers. This data then be used for further analysis if corporate sponsorships for the projects such as Django Foundation, PFS etc. provides more value than forking and maintaining own toolsets, and also can be used as an argument for business to support such foundations.
We are all interested in using maintained and secure solutions. It may happen that using open development models not only helps projects survive in a long term, but also provides secondary benefits, such as spreading good engineering practices, socializing and onboarding newcomers.
The goal is to improve our existing OSCI code which ranks companies on the basis of the number of commits, because the current situation is that there appear to be large number of of commits done by automated processes associated with GitHub accounts that have a company (commercial organization) email domain. These skew the ranking of companies based on commits, which is precisely why our OSCI ranking is based on number of contributors rather than number of commits.
For example, when we look at the OSCI commit-based company counts to end June 2020, we see
OrgName | Commits |
---|---|
Microsoft | 640009 |
GitHub | 519108 |
Renovateapp | 472705 |
379847 | |
Red Hat | 331087 |
Travis CI | 195377 |
Intel | 150613 |
IBM | 131510 |
Exoplatform | 125844 |
Odoo | 113452 |
Pyup | 82118 |
However, Renovateapp, Travis CI, Exoplatform and Pyup do not feature highly in our OSCI countributor-based company ranking. In fact, Renovateapp has only 4 active contributors, Travis CI has 67, Exoplatform has 41, Pyup has 4.
When we dig deeper into this, we see:
This is top of commits authors for Pyup:
Company | AuthorName | Commits |
---|---|---|
Pyup | pyup-bot | 349717 |
Pyup | pyup.io bot | 10146 |
Pyup | pyup.io vuln bot | 22 |
Pyup | pyup.io bot (via Travis CI) | 1 |
As you can see all of them are bots.
The same picture for Renovateapp:
Company | AuthorName | Commits |
---|---|---|
Renovateapp | Renovate Bot | 2348935 |
Renovateapp | WhiteSource Renovate | 65148 |
Renovateapp | Renovate Bot (via Travis CI) | 358 |
Renovateapp | renovate-bot | 63 |
Renovateapp | Rhys Arkins | 3 |
TravisCI (Top 10 by commits):
Company | AuthorName | Commits |
---|---|---|
Travis CI | Deployment Bot (from Travis CI) | 426727 |
Travis CI | Travis CI | 92799 |
Travis CI | travis-ci | 11824 |
Travis CI | TravisCI | 9511 |
Travis CI | Travis | 8128 |
Travis CI | Deployment Bot (Travis) | 7723 |
Travis CI | Deployment Bot | 1917 |
Travis CI | raveit65 | 1322 |
Travis CI | Piotr Milcarz | 1317 |
Travis CI | Travis Build Bot (from Travis CI) | 1015 |
The biggest part of commits comming from bots
We would like a way to filter out these automated processes/bot commits, so that we could more accurately generate a ranking of companies based on commits.
One obvious way is to simply have a 'blacklist' of GitHub accounts / email addresses, but perhaps something more sophisticated could be devised, based on 'unhuman' levels of activity.
At the moment, we are using the domain <-> company match list, which filters companies from the top that we form. Perhaps the problem of bots can be solved by creating a similar list that will filter out bots.
I found 2 issues on the local generated reports:
people appear in the contributor ranking report but it are missing from the repository commits
EX:
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep enderborg
Sony,peter enderborg,xxxxxxxxxxxxx@xxxxxxxxxx,57
cat Company-contributors-repository-commits_YTD_2022-01-31.csv | grep enderborg
returns nothing
Do you have any idea why some persons are missing ?
In the same report my contributions are counted separate for the same email address
cat OSCI_Contributors_ranking_YTD_2022-01-31.csv | grep Alin
Sony,Alin Jerpelea,xxxxxxxxxxxxx@xxxxxxxxxx,90
Sony,Alin,xxxxxxxxxxxxx@xxxxxxxxxx,56
(the email address is the same)
Thanks
Is the data on https://opensourceindex.io/ automatically updated? It still shows data from 2023, no new data from 2024.
Hello,
I examine https://opensourceindex.io/
What I did:
-Click Industry menu filter
-Uncheck "select all"
-Scroll down
-Check empty line between lines "Healthcare & Pharma" and "Public Sector"
What I got:
-Line for FARFETCH organization.
What I expect:
-I don't have empty lines inside Industry drop-down menu filter
I check - osci/preprocess/match_company/company_domain_match_list.yaml
I assume that industry field was missed for FARFETCH company
Possible solutions:
I am collaborating with Duane O'Brien on a Digital Infrastructure Project called "Fostering Open Collaboration" (FOCUSED) which is researching open source program offices within companies.
I would like to access the data on the Open Source Contributor Index (OSCI) website.
Source GitHub repo: https://github.com/epam/OSCI
I am looking for guidance on how to obtain it, or a link to the data that appears there, as a .CSV file.
Thank you.
cc: @DuaneOBrien
Dear team,
i am trying to run a local osci instllation to get some stats. After failing installing tools in ubuntu 22.04 lts, i switched to 20.04 lts and got at least dependencies installed using pip and pyhton 3.8.
Now the first step of the pipeline is failing, i.e. running python3 osci-cli.py get-github-daily-push-events -d 2020-01-02
[2023-07-11 12:09:27,286] [ERROR] Failed to parse json: . Error: Expecting value: line 1 column 1 (char 0) [2023-07-11 12:09:27,329] [INFO] Save push events commits for 2020-01-02 00:00:00 into file /data/landing/github/events/push/2020/01/02/2020-01-02-0.parquet Traceback (most recent call last): File "osci-cli.py", line 93, in <module> cli(standalone_mode=False) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__ return self.main(*args, **kwargs) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/sten/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke return callback(*args, **kwargs) File "/home/sten/Desktop/OSCI/osci/actions/base.py", line 59, in execute return self._execute(**self._process_params(kwargs)) File "/home/sten/Desktop/OSCI/osci/actions/load/load.py", line 34, in _execute return get_github_daily_push_events(day=day) File "/home/sten/Desktop/OSCI/osci/crawlers/github/gharchive.py", line 34, in get_github_daily_push_events DataLake().landing.save_push_events_commits(push_event_commits=push_events_commits, date=day) File "/home/sten/Desktop/OSCI/osci/datalake/local/landing.py", line 42, in save_push_events_commits log.info(f'Push events commits df info {get_pandas_data_frame_info(df)}') File "/home/sten/Desktop/OSCI/osci/utils.py", line 46, in get_pandas_data_frame_info df.info(buf=buf) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2497, in info mem_usage = self.memory_usage(index=True, deep=deep).sum() File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 2590, in memory_usage result = Series(self.index.memory_usage(deep=deep), index=["Index"]).append( File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/series.py", line 305, in __init__ data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/construction.py", line 465, in sanitize_array subarr = construct_1d_arraylike_from_scalar(value, len(index), dtype) File "/home/sten/.local/lib/python3.8/site-packages/pandas/core/dtypes/cast.py", line 1452, in construct_1d_arraylike_from_scalar subarr = np.empty(length, dtype=dtype) TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type
Would docker be a stable environment to run? My aim is to count github conributions based on some email-regexps.
Thanks!
I wanted to check if there are some problems with the data? Did you change anything or reduce the frequency of the updates? Or maybe is it just a bug?
Here are my current findings:
Hello,
I was looking for the country that the OSCI company would be headquartered in.
Would you know what's the best way to obtain this information?
Thank you.
For anyone else looking in on this issue we collated the data for OSCI into a series of files that are available for all to be downloaded here: https://ststaticprodosciwebz2vmu.blob.core.windows.net/data/share/OSCI_change_ranking.zip
cc: @DuaneOBrien
https://opensourceindex.io/?company=mongoDB reports https://github.com/mongodb/mongo as a "top repo"
"Versions released prior to October 16, 2018 are published under the AGPL. All versions released after October 16, 2018, including patch fixes for prior versions, are published under the Server Side Public License (SSPL) v1"
Source: https://github.com/mongodb/mongo#license
Clicking on Industry to filter does not display any options.
Note: It works for Jan 2022.
The goal is to create and automate analysis of repos hosted on SourceForge (https://sourceforge.net/). This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.
We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on SourceForge. This is a summary of our findings:
Criteria | Status (Yes/No) | Notes (e.g. about how it is possible, or limitations, etc) |
---|---|---|
Is this site free to use for open source projects? | yes | |
Does it look like this site hosts many open source projects? | yes | “over 430,000 projects”Popular in open source community.BUT, it hosts a lot of binaries and mirrors of repos which are primarily hosted on github or elsewhere. |
Size of user base | - | "we host over 3.7 million registered users” |
Is there a public API we can query? | yes | |
API type | not studied yet | |
API URL | not studied yet | |
Query Limits (if any) | not studied yet | |
Is there a paid access with more information? | not studied yet | |
Is it possible to query the project license? | not studied yet | |
Is it possible to query commit events/commit counts by a user in a time period? | not studied yet | |
Is it possible to query email address or else some organization information for the person making a commit? | not studied yet | |
Is there a public archive we can use instead of the public API? | not studied yet | |
Any additional Information worth knowing? | not studied yet |
Hello,
My understanding is that OSCI considers commits made by an organization AFTER it has been added and any commits BEFORE will NOT be considered for ranking.
Example: I add Societe Generale in September. OSCI does NOT consider any commits made by Soc Gen before September and the ranking is given based on commits made from September to date.
Is that right?
The current OSCI implementation uses the email domain of the committer to identify their organization. Many developers do not use their company email address on GitHub, or do not make their email address public. However many of these people do include their organizational information in their GitHub user profiles.
We would like to improve the identification of committers organization using the data in their user profiles.
<<<>>>
We already made an experiment to do this, but with minimal success. This is described below.
The basic matching algorithm works like this:
If after applying the basic algorithm the matching did not occur, an extended algorithm was proposed:
Result of experiment:
For only 38% of all the users examined, we managed to match a company from their profile. The remaining profiles did not have a clear match. For milder match rules, only 5% is added.
It is also worth noting that for users where we managed to match their company from their profile, the company is the same as that received from the email in all cases.
Finally, this method of identifying company carries a large overhead. When implementing this approach, it will be necessary to download information for all users who made push events in 2020, their number (as of June 2020) is 5M - loading their profiles will take about 42 days calculating with GitHub API usage limits. We would also have to additionally load new profiles every day, and their download, in turn, may not fit into the daily usage limits.
(Housekeeping - I move the original issue written here by @abitrolly into #8)
Enhance the OSCI algorithm to filter only projects with open-source licenses.
This will require some external datasets.
Please, guys, don't be cynical.
Line 30 in 67443b6
While OSCI is primarily a reputation tool, it could actually do some good things if it could provide a ground for companies to support and compete in this user story.
What especially companies can do, if they like to support poetry in a way, is to give there employees a fixed time they can spend to support poetry. Having more people, who can contribute on a regular bases, would help us a lot. They don't necessarily need to start coding. Looking into the issue tracker and finding duplicates or outdated tickets or answer question there is important as well and would give us more time in actually fixing bugs or implement new features.
The goal is to create and automate analysis of repos hosted on BitBucket. This would be similar to our existing OSCI ranking which analyses repos hosted on GitHub, with a focus on the activity by commercial organizations.
We did a high-level technical analysis on the feasability of making an OSCI for repos hosted on Bitbucket. This is a summary of our findings:
Criteria | Status (Yes/No) | Notes (e.g. about how it is possible, or limitations, etc) |
---|---|---|
Is this site free to use for open source projects? | yes | Seems to be free only for teams under 5 people unless you request a community license. |
Does it look like this site hosts many open source projects? | unclear | It's not clear that there are large numbers of open source projects hosted.Most public projects seem to be non-commercial (not outsourced by companies). It appears most users do not use company domains - need to investigate further.The pages/repos of many companies appear to be not active. |
Size of user base | - | In the order of 5,000,000 users |
Is there a public API we can query? | yes | |
API type | REST | |
API URL | https://api.bitbucket.org/2.0 | |
Query Limits (if any) | 1,000 per hour 60,000 per hour | |
Is there a paid access with more information? | - (to be investigated) | |
Is it possible to query the project license? | Yes | |
Is it possible to query commit events/commit counts by a user in a time period? | Yes | /repositories?before=timestamp&after=timestamp e.g. https://api.bitbucket.org/2.0/repositories?after=2020-03-01T09%3A37%3A06.254721%2B00%3A00 |
Is it possible to query email address or else some organization information for the person making a commit? | Yes | email address |
Is there a public archive we can use instead of the public API? | no |
If you have additional questions, feel free to contact our team.
In this paragraph - https://github.com/epam/OSCI#where-can-i-see-the-latest-rankings
This project is sponsored by EPAM Systems and the latest results are visible on the EPAM SolutionsHub OSCI page.
It is better to clarify that no sponsorship is available to outside contributors.
While visualizing some of the data OSCI provides I noticed an unusual spike in the data starting end of October/beginning of November 2021 and potentially still ongoing. I was wondering, if there was a change to how active contributors or community are measured?
This graph shows the development for Google over the last 3 years. Besides the usual spikes at the beginning of each year, there is an additional spike in November and after that the daily increase seems to be higher than in previous periodes as well.
The spike is there for different companies and not specific for the above example.
Hi and thanks for this project :)
We recently raised a PR to add Expedia Group to the list.
Unfortunately we couldn't find the company in the latest report published today. Have we missed anything in the PR or we can expect to make it in the next report?
I think it would be pertinent to include a filter that excludes contributions to the company's own open source projects.
As much as I enjoy seeing the numbers I feel like it would be amazing to see which companies contribute outside of their own circle of influence the most, this could shift the rankings somewhat and showcase a bit more of the open source community on the top lists.
Throwing this out there as an idea, absolutely understand if this is not relevant to this project but maybe something worth thinking about!
As mentioned in #122, I'm puzzled why SmartBear does not appear in the list, now that we're into the month of April.
We were added in we were added in v2022.03.0.
I'm not sure if this is user error on my part, so please let me know if there's something I'm misunderstanding, or we need to make some additional configuration.
Here's an example of a commit that I think should have been counted: https://github.com/cucumber/cucumber-js/commits/d98c6deabd39e1adb8e52a1a65324662143108e8
Hello,
My organization (Societe Generale) as on Feb 2022 seems to have total community number as 7 (+5) and 0 active contributors. Also it is indicated that we are down by 3 places.
I'd like to know why we are placed at 284 and not 276?
When the no of active contributors are equal, what are considered while ranking?
Please can you help with this query?
We plan to expand the scope of research.
We want to add two new reports:
OSCI_Languages_YTD
: report on the number of the company commits in the programming language since the beginning of the year.OSCI_Licenses_YTD
: report on the number of the company commits in the repository with a license since the beginning of the year.daily-osci-rankings
cli command.Example output:
company | language | commits |
---|---|---|
python | 50 | |
go | 30 | |
Microsoft | typescript | 40 |
Microsoft | powershell | 20 |
daily-osci-rankings
cli commandExample output:
company | license | commits |
---|---|---|
apache-2.0 | 50 | |
mit | 30 | |
Microsoft | gpl-3.0 | 40 |
Microsoft | lgpl-2.1 | 20 |
Hi,
Say if developer X makes 10 commits in Jan 2021 and does another 10 in the next month i.e. Feb 2021 and there is developer Y who makes 10 commits in the month of Feb 2021 only.
Does the active contributor number for the organization show 2 or 3?
I assume it is 2.
Please can someone help confirm.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.