Comments (16)
Reread the docs. It's pretty clear I need to download data first from the GH Archive. I think that's what I was missing. Thanks, Vlad.
from osci.
OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..
The path to data is actually hardcoded here.
OSCI/__app__/datalake/local/base.py
Line 26 in f73e484
In your case it should be /Users/richard/src/OSCI/__app__/data
and not /data
. The code that fails to calculate the BASE_PATH
.
OSCI/__app__/datalake/local/base.py
Lines 46 to 47 in f73e484
Unless BASE_AREA_DIR
is set to some long ../../../..
pattern, this code should not fail.
There could be another explanation for this magic config override, and this is directly related to overengineered code is that it looks like somebody tried to apply singleton pattern to DataLake
class in rather straightforward Java way. Which means that whoever called DataLake()
first could probably configure and set its path to /data
elsewhere. Unfortunately, digging where this happens requires a debugger.
from osci.
@RichardLitt these paths are automatically generated based on the config file. So you need to change your local.yml
with the absolute path to the directory which you want to contain information.
For example
OSCI/__app__/config/files/default.yml
Line 5 in f73e484
Change
base_path: '/data'
to base_path:'/Users/richard/src/OSCI/data'
or something else whatever path you want.
@abitrolly this path not come from
OSCI/__app__/datalake/local/base.py
Line 26 in f73e484
It gets path from config
from osci.
It gets path from config
@vlad-isayko so where is the code that does this?
from osci.
@vlad-isayko after the app is installed, it will not be able to look for local.yml
in checkout anymore. What is the supposed location for that file in that case?
from osci.
@abitrolly ,
Setup config for local fs
Lines 70 to 85 in 37535ea
Initiate local data lakes
OSCI/__app__/datalake/datalake.py
Lines 43 to 47 in 37535ea
Get config passed to the constructor
OSCI/__app__/datalake/local/base.py
Lines 25 to 33 in 37535ea
from osci.
@vlad-isayko after the app is installed, it will not be able to look for
local.yml
in checkout anymore. What is the supposed location for that file in that case?
local.yml
is really not included in the repository, but intentionally, since it is meant as a configuration for personal test runs. And for production launches, we suggest using the transfer of secrets and configurations from environment variables.
There is a file for this prod.yml
It describes from which environment variables the values will be taken.
The source of values is described through the value 'env'
OSCI/__app__/config/files/prod.yml
Lines 1 to 2 in 37535ea
So, for example, the value of container
will be requested from the environment variable osci_landing_container
OSCI/__app__/config/files/prod.yml
Lines 7 to 9 in 37535ea
Secrets from databricks, which are transferred through the dbutils module (proprietary module for Spark clusters in the Databricks environment), can also act as a source of values. An example is found prod-cluster.yml
from osci.
Thanks for the clarifications. The configuration code raises many questions.
And for production launches, we suggest using the transfer of secrets and configurations from environment variables.
Then it would be worth documenting them at https://github.com/epam/OSCI#configuration
Why environment variables can not be used for testing as well?
from osci.
OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..
This is really not helpful. Please, be respectful. People have worked really hard on this code, and it does some really important work.
@vlad-isayko Thank you! Should I download data from somewhere, first? Is the data included in this repo?
from osci.
@RichardLitt so how would you say that the code is overengineered and ask if it is autogenerated?
from osci.
Depends on what date you want to get results for.
All our YTD reports (that is, the data is counted from the beginning of the year to the required date, for example, for February 13, 2021, it is necessary to download and process data for all dates starting from January 1, 2021).
So for each day, you need to sequentially run several commands:
For example for January 1, 2021
# Load push events for 2021-01-01
python3 osci.py get-github-daily-push-events -d 2021-01-01
# Adds a company field for each commit and filters out those non-company commits
python3 osci.py process-github-daily-push-events -d 2021-01-01
# Highlights repositories that had company commits that day
python3 osci.py daily-active-repositories -d 2021-01-01
# Load info from Github API about repositories that had company commits that day
python3 osci.py load-repositories -d 2021-01-01
# Clears company commits from those commits that were sent to repositories without licenses
# We assumed that the availability of licenses is a factor of belonging to OpenSource (factor suggested by Red Hat https://www.redhat.com/en/topics/open-source/what-is-open-source-software#:~:text= Open% 20source% 20software% 20is% 20released, legally% 20available% 20to% 20end% 2Dusers.)
python3 osci.py filter-unlicensed -d 2021-01-01
# Builds OSCI Ranking and OSCI Commits Ranking reports for January 1, 2021
python3 osci.py daily-osci-rankings -td 2021-01-01
from osci.
Change base_path: '/data' to base_path:'/Users/richard/src/OSCI/data' or something else whatever path you want.
I don't currently have the data. How do I download it? Is that what you're referring to, above?
Is there any way to get data from before 2021?
from osci.
@abitrolly Asking if something is overengineered and autogenerated could be seen as a value judgement, by you, of the quality of the code. Someone has worked hard at that code. Asking "Hey, I'm having trouble finding the relevant areas in the code" is much kinder, because it makes the issue about you and not about their code. I always assume that if there's something I can't understand, it's because I am missing some information - which means that we can work together to solve that problem for others. Claiming that code is confusing is putting the blame on the other party, which isn't a good way to start a conversation for the maintainer. Anyone responding will often be doing so on their own time, so it's kind to make sure that they want to help you.
from osci.
@RichardLitt while I agree with you, I am biased that this repository in not an open source project in a community sense, and all the work being done here is being paid by the outsourcing corporation that need this project for marketing purposes. Doesn't make me a good person to treat paid developers differently than free time maintainers, but at least they get compensated for their time. It is kind of a poor man's rant over the those who better off in a walled garden. Sorry about that.
from osci.
@vlad-isayko I'm sorry that the conversation has been derailed. I appreciate you and your work.
Back to the issue at hand: I don't have any data locally. Where do I get it? Am I missing something?
from osci.
To clarify - I believe you tried to answer this question above, but the first command, python3 osci.py get-github-daily-push-events -d 2021-01-01
, also doesn't work without a /data folder.
from osci.
Related Issues (20)
- Calculating ranking of an organization HOT 1
- Unable to filter by Industry for Feb 2022 HOT 5
- Query on ranking HOT 5
- Unable to get basic example to run HOT 30
- Timezones affecting monthly filter drop-down around end of month? HOT 3
- SmartBear not appearing in the list HOT 6
- Looking to obtain data in csv format HOT 1
- Unusual spike in data? (starting Nov 21) HOT 2
- Clarification counting method HOT 2
- How to obtain country name for the records in the dataset HOT 2
- Report issues - data does not add up HOT 4
- Data inconsistency or update issues? HOT 13
- Certificate expired HOT 2
- How to run OSCI in 2023? HOT 1
- Kibana is misclassified as Open Source
- MongoDB is misclassified as Open Source
- HashiCorp projects are misclassified as Open Source HOT 1
- Something was missed in Industry drop-down list HOT 2
- Add number of repositories contributed to. HOT 7
- Measuring company support for known OSS public projects
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from osci.