Comments (13)
... or at least you could start small and list at least the number of the repositories collaborators of the organizations contribute to.
If most of the organization contributors contribute to single or few repositories, this is a good indication of their efforts. :)
from osci.
The idea is pretty interesting.
There are also a number of primary questions that arise before the implementation of this idea.
Main question:
How to identify the repositories in relation to the company (a company's own repository or not)?
There is an option to use information about the organization (see OrgId). However, this is connected with the fact that you need to have a list of compliance of the company and the organization that belongs to it. It turns out that it is necessary to create such a list by hand for each company and constantly keep it up to date. And again, there is no certainty that this criterion is 100% valid.
Do you have any ideas on this?
from osci.
How to identify the repositories in relation to the company (a company's own repository or not)?
- If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.
- If all commits and merge requests are from the company
from osci.
1. If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.
I agree that at first glance, maintaining a list of repositories does not differ much from maintaining a list of companies. But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.
2. If all commits and merge requests are from the company
I didn't quite understand what it meant. Could you explain a little more broadly?
You are suggested to think that the company's own repository is those repositories in which commits are only from the company, right? Is this a necessary and/or sufficient condition?
from osci.
But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.
It could happen that the amount of non-owned repositories that companies are committing to is non-significant.
I didn't quite understand what it meant. Could you explain a little more broadly?
The repo where all commits are from corporate emails are definitely owned by the company. That's a sufficient condition for a filter. )
from osci.
Sorry for taking a while to respond, I simply don't have enough information on the workflow that OSCI uses (my bad) to elaborate further than what @abitrolly said. I would only ever consider a contribution to be in the company's full self-interest if the contribution landed on a repository that was owned by the company itself.
Is this a trivial task? Very unlikely, I think a "repo where all commits are from corporate emails" is too specific of a scenario and wouldn't affect the dataset very much (especially for the top dogs which is where my interest lies the most)...
We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.
from osci.
We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.
I agree. That would be sufficient.
from osci.
I suggest the way to move forward on this issue is:
- pick a company at random
- look at the list of repos which OSCI is showing their employees contribute to
- try to define some logic (algorithm) defining which of these repos are "company repos" vs "non-company repos". As part of this task you will have to define what is a "company repo", that in itself will be challenging.
- Now pick another company at random and test the logic you came up with, refine it.
- And so on with additional companies until you have logic which appears to manage the general case.
It's important to understand that a perfect algorithm for this does not exist, just different directions to go, each with pros and cons. An empirical approach (if that's the right term) like I suggest above is necessary rather than defining a theoretical approach. Your goal has to be to iterate until you reach a logic which is "good enough" to show a general picture of activity across organizations. This was our experience defining the logic for OSCI itself. What looks easy at a high level gets very challenging when one tries to define the detail and algorithmize it.
from osci.
As part of this task you will have to define what is a "company repo", that in itself will be challenging.
def outside_contributions():
employees_committed
contractors_committed
robots_committed
total_committed
if (total_committed - employees_committed - contractors_committed - robots_committed > 0):
return True
from osci.
Let's take company ACME. It creates and runs project X. This project is not under the ACME org on github, so programmatically not directly connectable to the company. The project has 100 contributors, 99 who work at ACME and 1 who is outside (perhaps it is an ex-employee who worked on this before leaving the company and continued after... I have seen such examples). Is this a company project?
from osci.
What could be the simplest and probably not the most accurate insight? While getting perfect stats sounds sweet, most likely we will not get there right away. So... what could be done right now to make the index by 1% better?
How about CLA's? Could those be considered as indication? If repo is requiring to submit CLA, could it be considered X org repository?
Could manual PR process be implemented to metatag the repos? Like... community could submit PR's to this repository to mark/add indexed repos to one or the other category and even augment the metadata? While fully automated process is neat... i think mostly we are interested in like... 2-5K public repositories and those definitely could be meta-tagged manually over the time.
from osci.
Maybe the priority should be to publish the data that could make different kind of filters possible. Right now the site https://opensourceindex.io/ just links to this repo with no diagrams of the DB schema are no information if the Big Query datasets are being public.
from osci.
At our company, internally we gather public data on GitHub activity from employees who choose to opt-in regarding their GitHub activity and contributions, with the goal of identifying trends in contributions to projects outside of Microsoft's governance. Our data is skewed differently than this index, however, since we have an internal indicator of who our employees are on GitHub once they opt-in to tell us, vs having to determine it from profiles.
Our numbers for December 2021, for example, are significantly higher for 'total community' and other figures as a result of so many people being e-mail private on GitHub... but of that specific month's contributions, I tried pulling equivalent data, and around a third of our actively-open-contributing employees contributed to projects not governed by our company, yielding a number higher than the index but not majorly larger.
While the data is interesting, our key reason for differentiating "is it controlled by Microsoft or not" is to help encourage our employees' participation in communities to become eligible in our FOSS Fund and to evolve the culture.
I agree slicing off a company's controlled projects is an interesting pivot, but a murky gray area, especially given foundations and cross-industry collaborations and so on.
from osci.
Related Issues (20)
- Calculating ranking of an organization HOT 1
- Unable to filter by Industry for Feb 2022 HOT 5
- Query on ranking HOT 5
- Unable to get basic example to run HOT 30
- Timezones affecting monthly filter drop-down around end of month? HOT 3
- SmartBear not appearing in the list HOT 6
- Looking to obtain data in csv format HOT 1
- Unusual spike in data? (starting Nov 21) HOT 2
- Clarification counting method HOT 2
- How to obtain country name for the records in the dataset HOT 2
- Report issues - data does not add up HOT 4
- Data inconsistency or update issues? HOT 13
- Certificate expired HOT 2
- How to run OSCI in 2023? HOT 1
- Kibana is misclassified as Open Source
- MongoDB is misclassified as Open Source
- HashiCorp projects are misclassified as Open Source HOT 1
- Something was missed in Industry drop-down list HOT 2
- Is https://opensourceindex.io/ not updated anymore? HOT 1
- Measuring company support for known OSS public projects
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from osci.