Giter Site home page Giter Site logo

Unable to get data about osci HOT 16 CLOSED

epam avatar epam commented on May 26, 2024
Unable to get data

from osci.

Comments (16)

RichardLitt avatar RichardLitt commented on May 26, 2024 1

Reread the docs. It's pretty clear I need to download data first from the GH Archive. I think that's what I was missing. Thanks, Vlad.

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..

The path to data is actually hardcoded here.

BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'

In your case it should be /Users/richard/src/OSCI/__app__/data and not /data. The code that fails to calculate the BASE_PATH.

def _github_events_commits_base(self) -> Union[str, Path]:
return self.BASE_PATH / self.BASE_AREA_DIR / 'github' / 'events' / 'push'

Unless BASE_AREA_DIR is set to some long ../../../.. pattern, this code should not fail.

There could be another explanation for this magic config override, and this is directly related to overengineered code is that it looks like somebody tried to apply singleton pattern to DataLake class in rather straightforward Java way. Which means that whoever called DataLake() first could probably configure and set its path to /data elsewhere. Unfortunately, digging where this happens requires a debugger.

from osci.

vlad-isayko avatar vlad-isayko commented on May 26, 2024

@RichardLitt these paths are automatically generated based on the config file. So you need to change your local.yml with the absolute path to the directory which you want to contain information.
For example

base_path: '/data'

Change base_path: '/data' to base_path:'/Users/richard/src/OSCI/data' or something else whatever path you want.

@abitrolly this path not come from

BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'

It gets path from config

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

It gets path from config

@vlad-isayko so where is the code that does this?

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

@vlad-isayko after the app is installed, it will not be able to look for local.yml in checkout anymore. What is the supposed location for that file in that case?

from osci.

vlad-isayko avatar vlad-isayko commented on May 26, 2024

@abitrolly ,
Setup config for local fs

class LocalFileSystemConfig(FileSystemConfig):
@property
def base_path(self) -> str:
return self.file_system_cfg.get('base_path')
@property
def landing_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.landing_container)
@property
def staging_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.staging_container)
@property
def public_props(self) -> Dict[str, Any]:
return dict(base_path=self.base_path, base_area_dir=self.public_container)

Initiate local data lakes

@staticmethod
def __get_local_data_lakes() -> Tuple[LocalLandingArea, LocalStagingArea, LocalPublicArea]:
return (LocalLandingArea(**Config().file_system.landing_props),
LocalStagingArea(**Config().file_system.staging_props),
LocalPublicArea(**Config().file_system.public_props))

Get config passed to the constructor

class LocalSystemArea(BaseDataLakeArea):
BASE_PATH = Path(__file__).parent.parent.parent.resolve() / 'data'
FS_PREFIX = 'file'
BASE_AREA_DIR = None
def __init__(self, base_path=BASE_PATH, base_area_dir=BASE_AREA_DIR):
super().__init__()
self.BASE_PATH = Path(base_path)
self.BASE_AREA_DIR = base_area_dir

from osci.

vlad-isayko avatar vlad-isayko commented on May 26, 2024

@vlad-isayko after the app is installed, it will not be able to look for local.yml in checkout anymore. What is the supposed location for that file in that case?

local.yml is really not included in the repository, but intentionally, since it is meant as a configuration for personal test runs. And for production launches, we suggest using the transfer of secrets and configurations from environment variables.

There is a file for this prod.yml

It describes from which environment variables the values will be taken.

The source of values is described through the value 'env'

meta:
config_source: 'env'

So, for example, the value of container will be requested from the environment variable osci_landing_container

areas:
landing:
container: 'osci_landing_container'

Secrets from databricks, which are transferred through the dbutils module (proprietary module for Spark clusters in the Databricks environment), can also act as a source of values. An example is found prod-cluster.yml

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

Thanks for the clarifications. The configuration code raises many questions.

And for production launches, we suggest using the transfer of secrets and configurations from environment variables.

Then it would be worth documenting them at https://github.com/epam/OSCI#configuration
Why environment variables can not be used for testing as well?

from osci.

RichardLitt avatar RichardLitt commented on May 26, 2024

@abitrolly:

OMG. That's probably the most overengineered piece of Python code I've ever seen. :D I wonder if it is autogenerated?..

This is really not helpful. Please, be respectful. People have worked really hard on this code, and it does some really important work.

@vlad-isayko Thank you! Should I download data from somewhere, first? Is the data included in this repo?

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

@RichardLitt so how would you say that the code is overengineered and ask if it is autogenerated?

from osci.

vlad-isayko avatar vlad-isayko commented on May 26, 2024

@RichardLitt

Depends on what date you want to get results for.

All our YTD reports (that is, the data is counted from the beginning of the year to the required date, for example, for February 13, 2021, it is necessary to download and process data for all dates starting from January 1, 2021).

So for each day, you need to sequentially run several commands:

For example for January 1, 2021

# Load push events for 2021-01-01
python3 osci.py get-github-daily-push-events -d 2021-01-01

# Adds a company field for each commit and filters out those non-company commits
python3 osci.py process-github-daily-push-events -d 2021-01-01

# Highlights repositories that had company commits that day
python3 osci.py daily-active-repositories -d 2021-01-01

# Load info from Github API about repositories that had company commits that day
python3 osci.py load-repositories -d 2021-01-01

# Clears company commits from those commits that were sent to repositories without licenses
# We assumed that the availability of licenses is a factor of belonging to OpenSource (factor suggested by Red Hat https://www.redhat.com/en/topics/open-source/what-is-open-source-software#:~:text= Open% 20source% 20software% 20is% 20released, legally% 20available% 20to% 20end% 2Dusers.)
python3 osci.py filter-unlicensed -d 2021-01-01

# Builds OSCI Ranking and OSCI Commits Ranking reports for January 1, 2021
python3 osci.py daily-osci-rankings -td 2021-01-01

from osci.

RichardLitt avatar RichardLitt commented on May 26, 2024

Change base_path: '/data' to base_path:'/Users/richard/src/OSCI/data' or something else whatever path you want.

I don't currently have the data. How do I download it? Is that what you're referring to, above?

Is there any way to get data from before 2021?

from osci.

RichardLitt avatar RichardLitt commented on May 26, 2024

@abitrolly Asking if something is overengineered and autogenerated could be seen as a value judgement, by you, of the quality of the code. Someone has worked hard at that code. Asking "Hey, I'm having trouble finding the relevant areas in the code" is much kinder, because it makes the issue about you and not about their code. I always assume that if there's something I can't understand, it's because I am missing some information - which means that we can work together to solve that problem for others. Claiming that code is confusing is putting the blame on the other party, which isn't a good way to start a conversation for the maintainer. Anyone responding will often be doing so on their own time, so it's kind to make sure that they want to help you.

from osci.

abitrolly avatar abitrolly commented on May 26, 2024

@RichardLitt while I agree with you, I am biased that this repository in not an open source project in a community sense, and all the work being done here is being paid by the outsourcing corporation that need this project for marketing purposes. Doesn't make me a good person to treat paid developers differently than free time maintainers, but at least they get compensated for their time. It is kind of a poor man's rant over the those who better off in a walled garden. Sorry about that.

from osci.

RichardLitt avatar RichardLitt commented on May 26, 2024

@vlad-isayko I'm sorry that the conversation has been derailed. I appreciate you and your work.


Back to the issue at hand: I don't have any data locally. Where do I get it? Am I missing something?

from osci.

RichardLitt avatar RichardLitt commented on May 26, 2024

To clarify - I believe you tried to answer this question above, but the first command, python3 osci.py get-github-daily-push-events -d 2021-01-01, also doesn't work without a /data folder.

from osci.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.