Giter Site home page Giter Site logo

xmichele / codemeta-harvester Goto Github PK

View Code? Open in Web Editor NEW

This project forked from proycon/codemeta-harvester

0.0 0.0 0.0 98 KB

Harvest and aggregate codemeta from source repositories and service endpoints, automatically converting known metadata schemes in the process. Improvements

License: GNU General Public License v3.0

Shell 95.55% Makefile 1.39% Dockerfile 3.06%

codemeta-harvester's Introduction

Codemeta Harvester

Project Status: WIP โ€“ Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

This is a harvester for software metadata. It actively attempts to detect and convert software metadata in source code repositories and converts this to a unified codemeta representation.

The tool is implemented as a simple POSIX shell script that in turn invokes a number of tools to do the actual work:

A few simple additional metadata extractions methods, as simple shell scripts, have been implemented alongside the main script.

This harvester can be used for two purposes:

  1. to harvest a possibly large number of software projects, for instance to make them available in some kind of search portal.
  2. as a means to produce a codemeta.json file for your own project

Installation

A docker container can be build as follows:

make docker

A pre-built container image can also be pulled from Docker Hub once the software is released:

docker pull proycon/codemeta-harvester

Alternatively if you prefer not to use containers, you can also install the software as follows:

  • Run make env to build a Python virtual environment in the env directory with the needed dependencies. This assumes you have a Python installation on your system.
  • Activate the environment with . env/bin/activate whenever you want to use it.
  • You will need to also ensure to install the following dependencies using your system's package manager
    • git
    • curl
    • dasel
    • recode
    • coreutils or busybox
    • GNU Make
    • GNU awk

You can use make devenv if you want to rely on the latest development release of codemetapy, rather than the latest stable version (this will create a devenv/ dir instead of env/)

Usage: producing codemeta for your project

In your project directory, which ideally should be a git clone, you can just run codemeta-harvester to create a codemeta.json file based on the files in your repository:

codemeta-harvester

You probably use the docker container, then the syntax is as follows:

docker run -v $(pwd):/data proycon/codemeta-harvester

The -v argument mounts your current working directory in the container, you may adapt it according to your needs.

If you want to regenerate an existing codemeta.json, rather than use it as input which would be the default behaviour, then add the --regen parameter. This overwrites any existing codemeta.json.

The harvester can make use of the Github/GitLab API to query metdata from GitHub/GitLab, but this allows only limited anonymous requests. Please set the environment variable $GITHUB_TOKEN/$GITLAB_TOKEN to a personal access token / [gitlab p. access token] (https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html), if you use Docker you should pass it to the container using --env-arg GITHUB_TOKEN=$GITHUB_TOKEN/--env-arg GITLAB_TOKEN=$GITLAB_TOKEN.

Usage: harvesting metadata for various projects

To harvest and collect metadata from various projects, you need to create configuration files that tells the harvester where to look. These are simple yaml configuration files, one for each tool to harvest. They are put into a directory of your choice, and take the following format:

source: https://github.com/user/repo
services:
    - https://example.org

The source property specifies a single source code repository where the source code of the tool lives. This must be git repository that is publicly accessible. Note that you can specify only one repository here, choose the one that is representative for the software as a whole.

The services property lists zero or more URLs where the tool can be accessed as a service. This may be a web application, simple webpage, or some other form of webservice. For webservices, rather than enumerate all service endpoints individually, this should be pointed to a URL that provides itself provides a specification of endpoints, for example a URL serving a OpenAPI specification. The information provided here will be expressed in the resulting codemeta.json through the targetProduct schema.org property as described in issue codemeta/codemeta#271. This links the source code to specific instantiations of the software.

Additional properties you may specify:

  • root - The root path in the source code repository where to look for metadata. This can be set if the tool lives as a sub-part of a larger repository. Defaults to the repository root.
  • scandirs - Sub directories to scan for metadata, in case not everything lives in the root directory.
  • ref - The git reference (a branch name of tag name) to use. You can set this if you want to harvest one particular version. If not set, codemeta-harvester will check out the latest version tag by default (this assumes you use some kind of semantic versioning for your tags). Only if no tags are present at all, it falls back to using the master or main branch directly.

Pass the directory where you put your configurations (or a single configuration file) to codemeta-harvester as follows:

codemeta-harvester /path/to/your/configdir/

Or for Docker:

docker run -v /path/to/your/configdir/:/config -v $(pwd):/data proycon/codemeta-harvester /config

Acknowledgement

This software was funded in the scope of the CLARIAH-PLUS project.

codemeta-harvester's People

Contributors

proycon avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.