Giter Site home page Giter Site logo

ecosyste-ms / ost Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 1.0 446 KB

A curated list of open technology projects to sustain a stable climate, energy supply, biodiversity and natural resources, based on data from https://opensustain.tech

Home Page: https://ost.ecosyste.ms

License: GNU Affero General Public License v3.0

Dockerfile 0.51% Ruby 67.66% Procfile 0.07% JavaScript 0.08% SCSS 0.06% HTML 31.62%

ost's Introduction

A curated list of open technology projects to sustain a stable climate, energy supply, biodiversity and natural resources, based on data from https://opensustain.tech/

This project is part of Ecosyste.ms: Tools and open datasets to support, sustain, and secure critical digital infrastructure.

API

Documentation for the REST API is available here: https://ost.ecosyste.ms/docs

The default rate limit for the API is 5000/req per hour based on your IP address, get in contact if you need to to increase your rate limit.

Development

For development and deployment documentation, check out DEVELOPMENT.md

Contribute

Please do! The source code is hosted at GitHub. If you want something, open an issue or a pull request.

If you need want to contribute but don't know where to start, take a look at the issues tagged as "Help Wanted".

You can also help triage issues. This can include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions.

Finally, this is an open source project. If you would like to become a maintainer, we will consider adding you if you contribute frequently to the project. Feel free to ask.

For other updates, follow the project on Twitter: @ecosyste_ms.

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Add tests for it. This is important so we don't break it in a future version unintentionally.
  • Send a pull request. Bonus points for topic branches.

Vulnerability disclosure

We support and encourage security research on Ecosyste.ms under the terms of our vulnerability disclosure policy.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copyright

Code is licensed under GNU Affero License © 2023 Andrew Nesbitt.

Data from the API is licensed under CC BY-SA 4.0.

ost's People

Contributors

andrew avatar dependabot[bot] avatar codeshark-net avatar

Stargazers

James Crowley avatar Michael avatar Jon Yoo avatar Cal Hays avatar  avatar Tobias Augspurger avatar

Watchers

 avatar

Forkers

codeshark-net

ost's Issues

Add namespace support for curated lists

Some very modular projects are managed in an entire namespace created just for that project. It might be helpful to allow the listing of whole namespaces. Unfortunately, this might break many of the conventions we have so far.

  • How do we deal with projects in a namespace that are not open source?
  • Some projects in a namespace might just be bad.
  • Some project might be very old.

So far we have tried to add the core repository, but in many cases this does not work when projects are very modular. Here is a well known example: https://github.com/IPCC-WG1

I do not know of a solution for this yet.

Missing projects in the download statistics

I compared the plots we did about Python dependence last year with the new dataset. Here some projects that are missing in the statistics that should be visible:

  1. https://pypi.org/project/xarray/ --> is listed but the download numbers are not correct: https://pypistats.org/packages/xarray
  2. https://pypi.org/project/plantcv/
  3. https://pypi.org/project/codecarbon/
  4. https://pypistats.org/packages/tsam
  5. https://github.com/powsybl/pypowsybl
  6. https://pypi.org/project/climetlab/
  7. https://pypi.org/project/intake-esm/
  8. https://github.com/Unidata/siphon
  9. https://github.com/oemof/oemof

I will also randomly go through the other programming languages filter out those with the most stars.

Avoid returning null values for certain fields

As we discussed briefly earlier, certain fields are returning null (sometimes).

This goes for:

  • "name" (see nsidc/earthaccess)
  • "language" (see softwareunderground/awesome-open-geoscience, metno/emep-ctm)
  • "category" (see nsidc/earthaccess)

If they could return "N/A" or some other static value when null (if that's supposed to happen) that'd be great.

naming of timestamps are unclear

I had a deep dive into the metadata the last days and noticed that the naming of timestamps is often unclear and the same name is used multiple times across the API. It might be helpful for a user to have a unique name for all values across the API. For example:

  • created_at -> entry_created_at
  • last_sync -> repository_last_sync
  • pushed_at -> latest_push_to_repo_at

I would also expose the pushed_at value at the top level, as this is likely to be a very important value for most users, showing the last time the project had any active development activity.

Add issues under Projects

Allow open issues, pr. the /issues/ endpoint, to be listed under the individual projects when pulling the /projects/ endpoint.

Most popular dependencies of projects in each category

For each category, get a list of direct dependencies for each project and group them up to show the top 50 most used dependencies.

We may need to filter out some very popular dependencies that show up for every category, potentially making a top 20 overall that can include them instead.

Bots may be counted as external users

I noticed that some projects are listed as having an external contributor (pull request or issue where the author association is not OWNER or MEMBER) but the only external contributor is a bot.

Example: https://github.com/suptower/weather-cli which has one external contributor but it's dependabot[bot]

This should be fixable without needing any extra implementation in the issues service.

normalize trailing slashes in urls

Have some duplicate projects in the db where one has a url with a trailing slash and the other doesn't, these should be normalized if possible, example: https://github.com/gavinsimpson/canadaHCD/

Flag for determining if repository has new issues

Yo!

In the other "first issue" repo they included a handy flag to determine if a repository "has_new_issues". It was based on the repository having new issues opened in the last 7 days, I believe.

How about we add that to the repository object? What do you think?

Add dependencies as additional donations option for OpenClimate.fund.

As discussed yesterday in the OST meeting we can add the OST dependencies that are unique in this space. This information can also be used for our next publication.

Most of the data is already been available here based on issue #87 : https://ost.ecosyste.ms/projects/dependencies

Andrew suggested that we remove the top 1% or 2% of projects dependencies on each package manager, so that we get the unique dependencies for OST based on programming language and category.

Clustering of projects using embeddings

To help automate discovery of new projects, I'd like to experiment with https://github.com/pgvector/pgvector and embeddings from a large language model to cluster projects together.

My plan is:

  • generate embeddings from the readme of each reviewed project
  • add the pgvector extension to postgresql
  • query the database for other projects closest to the embedding of the project
  • compare the "nearest" projects to their categories and topics
  • produce average vectors for each category of projects
  • produce an average of the vectors for all the reviewed projects
  • provide a interface (private for now due to API costs) for, a newly proposed project, to find out:
    • the closest existing projects
    • the closest categories
    • distance from each category average
    • distance from the total average
  • experiment with a selection of open source repositories (both climate related and totally unrelated) to find good distances to use as cut-off thresholds
  • experiment with including repo name, topics, description and other metadata when generating embeddings

Slimming down the data for the projects

Hey @andrew,

As promised, here's the mapping I've ended up using so far, it should give you insight into how we can slim down the projects endpoint for this.

const repositories = (await GetAllProjects()).map(
  ({ id, owner, name, url, language, repository, issues }) => {
    const { description, stargazers_count, license, last_synced_at } = repository;
    return {
      id: id.toString(),
      owner: owner.login,
      name,
      description,
      url,
      stars: stargazers_count,
      stars_display: formatStars(stargazers_count),
      license,
      last_modified: last_synced_at.toString(),
      language: { id: language, display: language },
      has_new_issues: false, // TODO: Keep this as is unless there's a way to determine the value
      issues: issues.map(
        ({ uuid, comments_count, created_at, number, title, labels, html_url }) => ({
          id: uuid,
          comments_count,
          created_at: created_at.toString(),
          number,
          title,
          labels: labels.map((label) => ({ id: label, display: label })),
          url: html_url
        })
      )
    };
  }
);

Increase the number of projects we can identify for good first issues

Here some thoughts to increaste the number of projects / issues we discover:

  1. Include other projects from the same namespace. Since many name ranges are used for many general projects outside the area of sustainability, this is more complex. We would first have to filter the namespaces that have a clear reference to sustainability.

  2. Increase the time frame to last_updated in the last two years. I think that's an easy and valid way to go. Open Source Software is often an slow staty process and older issuer are often still relevant over years.

  3. Add popular dependencies of projects. We could create a list of the X most popular / highly used first level dependencies to the list of projects we investigate. We could still use the sustainability category of the main projects so that a dependency get's various use case labels.

  4. Increase the total number of projects with automatic discovery based on NLP.

@andrew @Codeshark-NET

Missing projects in the API

Comparison of DOI to Citation APIs

The citation counts in the ecosystem repository API are all null. Therefor as a workaround for our study, I implemented 3 different APIs in our notebook for mapping DOI to citation counts. It might be interesting for ecosyste.ms to implement several of these APIs.

https://github.com/danielnsilva/semanticscholar
https://github.com/J535D165/pyalex
https://github.com/sckott/habanero

All Zenodo created DOIs will give me a zero citations. It might be interesting to get the zenodo metadata by the native API with such a tool:
https://github.com/dvolgyes/zenodo_get

Implementing categories for issues and projects

  • [high] As a developer, I'd like for the issues/projects data to include a "category" property, so that I can categorize/sort/filter on this in the client.

  • [low] As a developer, I'd like for the issues/projects to be filterable by "category", using a query param like ?cat= or similar, so that I can request a full category directly.

  • [low] As a developer, I'd like to be able to query the API for a list of all the categories available, not just the ones returned from issues/projects, so that I can use this to show empty categories & more.

Suggestion: Report feature

Howdy 🤠

As a user, if something is wrong with a project/repository, I would like to report that, adding a reason and an optional message.

This should help us in identifying potential issues or hiccups on the list.
This should be exposed via an endpoint.

^ Just a suggestion, adding it here for discussion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.