Light

ecosyste-ms / ost Goto Github PK

View Code? Open in Web Editor NEW

6.0 1.0 1.0 446 KB

A curated list of open technology projects to sustain a stable climate, energy supply, biodiversity and natural resources, based on data from https://opensustain.tech

Home Page: https://ost.ecosyste.ms

License: GNU Affero General Public License v3.0

Dockerfile 0.51% Ruby 67.66% Procfile 0.07% JavaScript 0.08% SCSS 0.06% HTML 31.62%

ost's Introduction

Ecosyste.ms: OST

A curated list of open technology projects to sustain a stable climate, energy supply, biodiversity and natural resources, based on data from https://opensustain.tech/

This project is part of Ecosyste.ms: Tools and open datasets to support, sustain, and secure critical digital infrastructure.

API

Documentation for the REST API is available here: https://ost.ecosyste.ms/docs

The default rate limit for the API is 5000/req per hour based on your IP address, get in contact if you need to to increase your rate limit.

Development

For development and deployment documentation, check out DEVELOPMENT.md

Contribute

Please do! The source code is hosted at GitHub. If you want something, open an issue or a pull request.

If you need want to contribute but don't know where to start, take a look at the issues tagged as "Help Wanted".

You can also help triage issues. This can include reproducing bug reports, or asking for vital information such as version numbers or reproduction instructions.

Finally, this is an open source project. If you would like to become a maintainer, we will consider adding you if you contribute frequently to the project. Feel free to ask.

For other updates, follow the project on Twitter: @ecosyste_ms.

Note on Patches/Pull Requests

Fork the project.
Make your feature addition or bug fix.
Add tests for it. This is important so we don't break it in a future version unintentionally.
Send a pull request. Bonus points for topic branches.

Vulnerability disclosure

We support and encourage security research on Ecosyste.ms under the terms of our vulnerability disclosure policy.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Copyright

Code is licensed under GNU Affero License © 2023 Andrew Nesbitt.

Data from the API is licensed under CC BY-SA 4.0.

ost's People

Contributors

Stargazers

Watchers

Forkers

codeshark-net

ost's Issues

Add namespace support for curated lists

Some very modular projects are managed in an entire namespace created just for that project. It might be helpful to allow the listing of whole namespaces. Unfortunately, this might break many of the conventions we have so far.

How do we deal with projects in a namespace that are not open source?
Some projects in a namespace might just be bad.
Some project might be very old.

So far we have tried to add the core repository, but in many cases this does not work when projects are very modular. Here is a well known example: https://github.com/IPCC-WG1

I do not know of a solution for this yet.

Detect the spoken language of the project

Primarily via the readme, maybe also via the description.

https://github.com/jtoy/cld is good for this.

Missing projects in the download statistics

I compared the plots we did about Python dependence last year with the new dataset. Here some projects that are missing in the statistics that should be visible:

https://pypi.org/project/xarray/ --> is listed but the download numbers are not correct: https://pypistats.org/packages/xarray
https://pypi.org/project/plantcv/
https://pypi.org/project/codecarbon/
https://pypistats.org/packages/tsam
https://github.com/powsybl/pypowsybl
https://pypi.org/project/climetlab/
https://pypi.org/project/intake-esm/
https://github.com/Unidata/siphon
https://github.com/oemof/oemof

I will also randomly go through the other programming languages filter out those with the most stars.

'last activity' doesn't seem to be updating

@jamescrowley pointed out that some projects do not update 'last activity'. Here is an example:

ClimateTriages shows last activity: '4 months' for the following project:
https://github.com/Green-Software-Foundation/if

Avoid returning null values for certain fields

As we discussed briefly earlier, certain fields are returning null (sometimes).

This goes for:

"name" (see nsidc/earthaccess)
"language" (see softwareunderground/awesome-open-geoscience, metno/emep-ctm)
"category" (see nsidc/earthaccess)

If they could return "N/A" or some other static value when null (if that's supposed to happen) that'd be great.

Fetch latest activity on all branches

Sometimes there isn't activity on the default branch of a repo but that is on a branch

naming of timestamps are unclear

I had a deep dive into the metadata the last days and noticed that the naming of timestamps is often unclear and the same name is used multiple times across the API. It might be helpful for a user to have a unique name for all values across the API. For example:

created_at -> entry_created_at
last_sync -> repository_last_sync
pushed_at -> latest_push_to_repo_at

I would also expose the pushed_at value at the top level, as this is likely to be a very important value for most users, showing the last time the project had any active development activity.

Add issues under Projects

Allow open issues, pr. the /issues/ endpoint, to be listed under the individual projects when pulling the /projects/ endpoint.

Most popular dependencies of projects in each category

For each category, get a list of direct dependencies for each project and group them up to show the top 50 most used dependencies.

We may need to filter out some very popular dependencies that show up for every category, potentially making a top 20 overall that can include them instead.

Bots may be counted as external users

I noticed that some projects are listed as having an external contributor (pull request or issue where the author association is not OWNER or MEMBER) but the only external contributor is a bot.

Example: https://github.com/suptower/weather-cli which has one external contributor but it's dependabot[bot]

This should be fixable without needing any extra implementation in the issues service.

normalize trailing slashes in urls

Have some duplicate projects in the db where one has a url with a trailing slash and the other doesn't, these should be normalized if possible, example: https://github.com/gavinsimpson/canadaHCD/

Flag for determining if repository has new issues

Yo!

In the other "first issue" repo they included a handy flag to determine if a repository "has_new_issues". It was based on the repository having new issues opened in the last 7 days, I believe.

How about we add that to the repository object? What do you think?

Remove archived Projects

We still have some archived project in the API. It would be great to filter them out. Here one example:
https://github.com/cnumr/ecoCode

Add support for project urls that are from github pages

a github pages url can be turned into a repository url: foo.github.io/bar => github.com/foo/bar

Improve fetching of publications

Currently fetching all dois found in the readme that come from doi.org and exist in openalex.org

List of DOIs that don't currently load: https://gist.github.com/andrew/ea895a3c7fe9173866ad2837abaef92b

Zenodo seems like the biggest source of missing data.

Add dependencies as additional donations option for OpenClimate.fund.

As discussed yesterday in the OST meeting we can add the OST dependencies that are unique in this space. This information can also be used for our next publication.

Most of the data is already been available here based on issue #87 : https://ost.ecosyste.ms/projects/dependencies

Andrew suggested that we remove the top 1% or 2% of projects dependencies on each package manager, so that we get the unique dependencies for OST based on programming language and category.

Clustering of projects using embeddings

To help automate discovery of new projects, I'd like to experiment with https://github.com/pgvector/pgvector and embeddings from a large language model to cluster projects together.

My plan is:

Slimming down the data for the projects

As promised, here's the mapping I've ended up using so far, it should give you insight into how we can slim down the projects endpoint for this.

const repositories = (await GetAllProjects()).map(
  ({ id, owner, name, url, language, repository, issues }) => {
    const { description, stargazers_count, license, last_synced_at } = repository;
    return {
      id: id.toString(),
      owner: owner.login,
      name,
      description,
      url,
      stars: stargazers_count,
      stars_display: formatStars(stargazers_count),
      license,
      last_modified: last_synced_at.toString(),
      language: { id: language, display: language },
      has_new_issues: false, // TODO: Keep this as is unless there's a way to determine the value
      issues: issues.map(
        ({ uuid, comments_count, created_at, number, title, labels, html_url }) => ({
          id: uuid,
          comments_count,
          created_at: created_at.toString(),
          number,
          title,
          labels: labels.map((label) => ({ id: label, display: label })),
          url: html_url
        })
      )
    };
  }
);

Add the education section as another category?

Would it be easy to add the education section just as another category? We could also fix this on the dataset side, but here I'm not sure how to deal with the subcategories.
https://github.com/protontypes/open-sustainable-technology/edit/main/education.md

As part of the API this data would already be part of ClimateTriage and further analytics.

Ignore bot activity in issue/pull request checks

One important feedback on the review function. It looks like that pull request created by bots is also considered activity.

Try to normalize homepage urls

i.e. add http to urls that are missing them

Increase the number of projects we can identify for good first issues

Here some thoughts to increaste the number of projects / issues we discover:

Include other projects from the same namespace. Since many name ranges are used for many general projects outside the area of sustainability, this is more complex. We would first have to filter the namespaces that have a clear reference to sustainability.
Increase the time frame to last_updated in the last two years. I think that's an easy and valid way to go. Open Source Software is often an slow staty process and older issuer are often still relevant over years.
Add popular dependencies of projects. We could create a list of the X most popular / highly used first level dependencies to the list of projects we investigate. We could still use the sustainability category of the main projects so that a dependency get's various use case labels.
Increase the total number of projects with automatic discovery based on NLP.

@andrew @Codeshark-NET

Missing projects in the API

I did a random check to see if any projects were missing, as the total number of projects seemed a bit low to me. Here are the projects I could find so far whose issues we should be getting, but which are not visible in the API:

Improve issue syncing

The following issues have not been updated since being closed:

OST and the issues service should be syncing and updating these on a daily basis

Comparison of DOI to Citation APIs

The citation counts in the ecosystem repository API are all null. Therefor as a workaround for our study, I implemented 3 different APIs in our notebook for mapping DOI to citation counts. It might be interesting for ecosyste.ms to implement several of these APIs.

https://github.com/danielnsilva/semanticscholar
https://github.com/J535D165/pyalex
https://github.com/sckott/habanero

All Zenodo created DOIs will give me a zero citations. It might be interesting to get the zenodo metadata by the native API with such a tool:
https://github.com/dvolgyes/zenodo_get

Implementing categories for issues and projects

[high] As a developer, I'd like for the issues/projects data to include a "category" property, so that I can categorize/sort/filter on this in the client.
[low] As a developer, I'd like for the issues/projects to be filterable by "category", using a query param like ?cat= or similar, so that I can request a full category directly.
[low] As a developer, I'd like to be able to query the API for a list of all the categories available, not just the ones returned from issues/projects, so that I can use this to show empty categories & more.

Suggestion: Report feature

Howdy 🤠

As a user, if something is wrong with a project/repository, I would like to report that, adding a reason and an optional message.

This should help us in identifying potential issues or hiccups on the list.
This should be exposed via an endpoint.

^ Just a suggestion, adding it here for discussion

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.