Giter Site home page Giter Site logo

proycon / codemetapy Goto Github PK

View Code? Open in Web Editor NEW
24.0 3.0 5.0 1017 KB

A Python package for generating and working with codemeta

Home Page: https://codemeta.github.io/

License: GNU General Public License v3.0

Python 100.00%
metadata metadata-extractor scientific codemeta linked-data schema-org

codemetapy's Introduction

Project Status: Active -- The project has reached a stable, usable state and is being actively developed. GitHub build GitHub release Latest release in the Python Package Index

Codemetapy

Codemetapy is a command-line tool to work with the codemeta software metadata standard. Codemeta builds upon schema.org and defines a vocabulary for describing software source code. It maps various existing metadata standards to a unified vocabulary.

For more general information about the CodeMeta Project for defining software metadata, see https://codemeta.github.io. In particular, new users might want to start with the User Guide, while those looking to learn more about JSON-LD and consuming existing codemeta files should see the Developer Guide.

Using codemetapy you can generate a codemeta.json file, which serialises using JSON-LD , for your software. At the moment it supports conversions from the following existing metadata specifications:

  • Python distutils/pip packages (setup.py/pyproject.toml)
  • Java/Maven packages (pom.xml)
  • NodeJS packages (package.json)
  • Debian package (apt show output)
  • Github API (when passed a github URL)
  • GitLab API (when passed a GitLab URL)
  • Web sites/services (see the section on software types and service below):
    • Simple metadata from HTML <meta> elements.
    • Script blocks using application/json+ld

It can also read and manipulate existing codemeta.json files as well as parse simple AUTHORS/CONTRIBUTORS files. One of the most notable features of codemetapy is that it allows chaining to successively update a metadata description based on multiple sources. Codemetapy is used in that way by the codemeta-harvester.

Note: If you are looking for an all-in-one solution to automatically generate a codemeta.json for your project, then codemeta-harvester is the best place to start. It is a higher-level tool that automatically invokes codemetapy on various sources it can automatically detect, and combined those into a single codemeta representation.

Installation

pip install codemetapy

Usage

Query and convert any installed python package:

$ codemetapy somepackage

Output will be to standard output by default, to write it to an output file instead, do either:

$ codemetapy somepackage > codemeta.json

or use the -O parameter:

$ codemetapy -O codemeta.json somepackage

If you are in the current working directory of any python project and there is a setup.pyor pyproject.toml, then you can simply call codemetapy without arguments to output codemeta for the project. Codemetapy will automatically run python setup.py egg_info if needed and parse it's output to facilitate this:

$ codemetapy

The tool also supports adding properties through parameters:

$ codemetapy --developmentStatus active somepackage > codemeta.json

To read an existing codemeta.json and extend it:

$ codemetapy -O codemeta.json codemeta.json somepackage

or even:

$ codemetapy -O codemeta.json codemeta.json codemeta2.json codemeta3.json

This makes use of an important characteristic of codemetapy which is composition. When you specify multiple input sources, they will be interpreted as referring to the same resource. Properties (on schema:SoftwareSourceCode) in the later resources will overwrite earlier properties. So if codemeta3.json specifies authors, all authors that were specified in codemeta2.json are lost rather than merged and the end result will have the authors from codemeta3.json. However, if codemeta2.json has a property that was not in codemeta3.json, say deveopmentStatus, then that will make it to the end rsult. In other words, the latest source always takes precedence. Any non-overlapping properties will be be merged. This functionality is heavily relied on by the higher-level tool codemeta-harvester.

If you want to start from scratch and build using command line parameters, use /dev/null as input, and make sure to pass some identifier and code repository:

$ codemetapy --identifier some-id --codeRepository https://github.com/my/code /dev/null > codemeta.json

This tool can also deal with debian packages by parsing the output of apt show (albeit limited):

$ apt show somepackage | codemetapy -i debian -

Here - represents standard input, which enables you to use piping solutions on a unix shell, -i denotes the input types, you can chain as many as you want. The number of input types specifies must correspond exactly to the number of input sources (the positional arguments).

Some notes on Vocabulary

For codemeta:developmentStatus, codemetapy attempts to assign full repostatus URIs whenever possible For schema:license, full SPDX URIs are used where possible.

Identifiers

We distinguish two types of identifiers, first there is the URI or IRI that identifies RDF resources. It is a globally unique identifier and often looks like a URL.

Codemetapy will assign new URIs for resources if and only if you pass a base URI using --baseuri. Moreover, if you set this, codemetapy will forcibly set URIs over any existing ones, effectively assigning new identifiers. The previous identifier will then be covered via the owl:sameAs property instead. This allows you to ownership of all URIs. Internally, codemetapy will create URIs for everything even if you don't specified a base URI (even for blank nodes), but these URIs are stripped again upon serialisation to JSON-LD.

The second identifier is the schema:identifier, of which there may even be multiple. Codemetapy typically expects such an identifier to be a simple unspaced string holding a name for software. For example, a Python package name would make a good identifier. If this property is present, codemetapy will use it when generating URIs. The schema:identifier property can be contrasted with schema:name, which is the human readable form of the name and may be more elaborate. The identifier is typically also used for other identifiers (such as DOIs, ISBNs, etc), which should come in the following form:

"identifier:" {
    "@type": "PropertyValue",
    "propertyID": "doi",
    "value": "10.5281/zenodo.6882966"
}

But short-hand forms such as doi:10.5281/zenodo.6882966 or as a URL like https://doi.org/10.5281/zenodo.6882966 are also recognised by this library.

Software Types and services

Codemetapy (since 2.0) implements an extension to codemeta that allows linking the software source code to the actual instantiation of the software, with explicit regard for the interface type. This is done via the schema:targetProduct property, which takes as range a schema:SoftwareApplication, schema:WebAPI, schema:WebSite or any of the extra types defined in https://github.com/SoftwareUnderstanding/software_types/ . This was proposed in this issue

This extension is enabled by default and can be disabled by setting the --strict flag.

When you pass codemetapy a URL it will assume this is where the software is run as a service, and attempt to extract metadata from the site and encode is via targetProduct. For example, here we read an existing codemeta.json and extend it with some place where it is instantiated as a service:

$ codemetapy codemeta.json https://example.org/

If served HTML, codemetapy will use your <script> block using application/json+ld if it provides a valid software types (as mentioned above). For other HTML, codemetapy will simply extract some metadata from HTML <meta> elements. Content negotation will be used and the we favour json+ld, json and even yaml and XML over HTML.

(Note: the older Entypoint Extension from before codemetapy 2.0 is now deprecated)

Graph

You can use codemetapy to generate one big knowledge graph expressing multiple codemeta resources using the --graph parameter:

$ codemetapy --graph resource1.json resource2.json

This will produce JSON-LD output with multiple resources in the graph.

Github API

Codemetapy can make use of the Github API to query metdata from GitHub, but this allows only limited anonymous requests before you hit a limit. To allow more requests, please set the environment variable $GITHUB_TOKEN to a personal access token.

GitLab API

Codemetapy can make use of the GitLab API to query metdata from GitLab, but this allows only limited anonymous requests before you hit a limit. To allow more requests, please set the environment variable $GITLAB_TOKEN to a personal access token.

Integration in setup.py

You can integrate codemeta.json generation in your project's setup.py, this will add an extra python setup.py codemeta command that will generate a new metadata file or update an already existing metadata file. Note that this must be run after python setup.py install (or python setup.py develop).

To integrate this, add the following to your project's setup.py:

try:
    from codemeta.codemeta import CodeMetaCommand
    cmdclass={
        'codemeta': CodeMetaCommand,
    }
except ImportError:
    cmdclass={}

And in your setup() call add the parameter:

cmdclass=cmdclass

This will ensure your setup.py works in all cases, even if codemetapy is not installed, and that the command will be available if codemetapy is available.

If you want to ship your package with the generated codemeta.json, then simply add a line saying codemeta.json to the file MANIFEST.in in the root of your project.

Acknowledgements

This work is conducted at the KNAW Humanities Cluster's Digital Infrastructure department in the scope of the CLARIAH project (CLARIAH-PLUS, NWO grant 184.034.023) as part of the FAIR Tool Discovery track of the Shared Development Roadmap.

codemetapy's People

Contributors

proycon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

codemetapy's Issues

codemetapy fails on pyproject.toml if some other tool section comes before tool.poetry

[tool.something]
this = "does not work"

[tool.poetry]
name = "dummy-project"
version = "0.1.0"
description = ""
authors = ["John Doe <[email protected]>"]
readme = "README.md"
packages = [{include = "dummy_project"}]

include = [
  "CHANGELOG.md",
  { path = "tests", format = "sdist" },
]

[tool.poetry.dependencies]
python = "^3.8"
pyproject-parser = "^0.9.0"
tomli = "^2.0.1"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Result:

Passed 1 files/sources but specified 0 input types! Automatically guessing types...
Detected input types: [('pyproject.toml', 'python')]
Note: You did not specify a --baseuri so we will not provide identifiers (IRIs) for your SoftwareSourceCode resources (and others)
Initial URI automatically generated, may be overriden later: file:///pyproject-toml
Processing source #1 of 1
Obtaining python package metadata for: pyproject.toml
Loading metadata from pyproject.toml via pyproject-parser
Failed to find complete enough metadata in pyproject.toml
Fallback: Loading metadata from pyproject.toml via PEP517
...
(fails)

Expected behavior: codemetapy finds tool.poetry section regardless of its position. TOML like JSON has no intrinsic order for objects, so it should make no difference.

Solution:

Fix these lines: https://github.com/proycon/codemetapy/blob/master/codemeta/parsers/python.py#L155-L156

The object tool is a dict, why not just look up directly using pyproject.tool["poetry"] ?

I don't think it makes sense to assume that other tools would provide equivalent metadata so keeping it that "open" does not work (e.g. we are building a new tool to specify exactly that kind of metadata, and this is how I stumbled on that bug). Also assuming that the first one will be the right one is incorrect anyway.

Fails if Python package has no requirements: TypeError: 'NoneType' object is not iterable

I'm trying to run codemetapy on my rlacalc Python package, and ran into two problems. Here they are, with workarounds for both of them.

Based on this from the README:

If you are in the current working directory of any python project, i.e. there is a setup.py, then you can simply call codemetapy without arguments to output codemeta for the project.

I just tried running it in the root directory, which didn't work:

Exception: No input files specified (use - for stdin)

Since modern Python development often doesn't involve the dangerous process of running setup.py, I suggest adding the ability to parse a pyproject.toml file, and in the meantime, changing the README to say something more like "If you are in the current working directory of a python project with a setup.py, ..."

So I switched to adding a package name, and that also failed:

$ codemetapy rlacalc
Passed 1 files/sources but specified 0 input types! Automatically guessing types...
Detected input types: [('rlacalc', 'python')]
Downloading context from https://raw.githubusercontent.com/codemeta/codemeta/2.0/codemeta.jsonld
Downloading context from https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/releases/13.0/schemaorgcontext.jsonld
Downloading context from https://w3id.org/software-types
Downloading context from https://raw.githubusercontent.com/proycon/repostatus.org/ontology/badges/latest/ontology.jsonld
URI automatically generated, may be overriden later: /rlacalc
Processing source #1 of 1
Obtaining python package metadata for: rlacalc
Found metadata in /home/neal/.local/lib/python3.8/site-packages/rlacalc-0.3.0.dist-info
WARNING: No translation for distutils key Metadata-Version
WARNING: No translation for distutils key Requires-Python
Traceback (most recent call last):
  File "/home/neal/.local/bin/codemetapy", line 8, in <module>
    sys.exit(main())
  File "/home/neal/.local/lib/python3.8/site-packages/codemeta/codemeta.py", line 128, in main
    output = build(**args.__dict__)
  File "/home/neal/.local/lib/python3.8/site-packages/codemeta/codemeta.py", line 290, in build
    prefuri = codemeta.parsers.python.parse_python(g, res, source, crosswalk, args)
  File "/home/neal/.local/lib/python3.8/site-packages/codemeta/parsers/python.py", line 121, in parse_python
    for value in pkg.requires:
TypeError: 'NoneType' object is not iterable

That seems to be because I don't have a requirements file, nor any requirements beyond Python....

It works to guard the for loop in line 121
like this:

if pkg.requires is not None:
    for value in pkg.requires:
....

But I also wonder if importlib shouldn't make that an empty array instead of None. So I'll leave the fix up to you.

Error when parsing python projects

Hi there. First thanks for your work. So far I could not make it work for any of my python projects with pyproject.toml using poetry as a build system. I always end up with this error below. So I tried if I can do it for codemetapy after a simple pip install of it and it fails in the same way.

I am on linux. Python 3.8.10

$codemetapy codemetapy > codemeta.json
Passed 1 files/sources but specified 0 input types! Automatically guessing types...
Detected input types: []
Traceback (most recent call last):
  File "/work/envs/db/bin/codemetapy", line 8, in <module>
    sys.exit(main())
  File "/work/envs/db/lib/python3.8/site-packages/codemeta/codemeta.py", line 136, in main
    g, res, args, contextgraph = build(**args.__dict__)
  File "/work/envs/db/lib/python3.8/site-packages/codemeta/codemeta.py", line 288, in build
    identifier = os.path.basename(inputsources[0][0]).lower()
IndexError: list index out of range

Add support for .zenodo.jsons

Many projects publish their code to zenodo and putting manual rich metadata into one .zenodo.json file in the repository.
One could also harvest this one.

Add support for ORCIDs

Authors are best identified by their ORCID. We ideally need a way of resolving user emails to orcids automatically (does their API offer such a function?).

Equivalent to `--no-extras` in v2.0.0+?

In v0.3.3 there was a -no-extras parameter, that seems to have been removed from codemetapy v2.0.0+.

$ docker run --rm -ti python:3.10 /bin/bash
root@127fa490e5df:/# python -m venv venv && . venv/bin/activate
(venv) root@127fa490e5df:/# python -m pip install --upgrade pip setuptools wheel
(venv) root@127fa490e5df:/# python -m pip --quiet install pyhf
(venv) root@127fa490e5df:/# python -m pip --quiet install 'codemetapy==0.3.3'
(venv) root@127fa490e5df:/# codemetapy --no-extras pyhf
(venv) root@127fa490e5df:/# python -m pip --quiet install 'codemetapy==2.0.0'
(venv) root@127fa490e5df:/# codemetapy --no-extras pyhf
usage: codemetapy [-h] [-t] [--exact-python-version] [--single-author] [-b BASEURI] [-B BASEURL] [-o OUTPUT] [-O OUTPUTFILE] [-i INPUTTYPES] [-g] [-s SELECT] [--css CSS]
                  [--no-cache] [--toolstore] [--strict] [--released] [--title TITLE] [--address ADDRESS] [--affiliation AFFILIATION] [--applicationCategory APPLICATIONCATEGORY]
                  [--applicationSubCategory APPLICATIONSUBCATEGORY] [--author AUTHOR] [--buildInstructions BUILDINSTRUCTIONS] [--citation CITATION] [--codeRepository CODEREPOSITORY]
                  [--contIntegration CONTINTEGRATION] [--contributor CONTRIBUTOR] [--copyrightHolder COPYRIGHTHOLDER] [--copyrightYear COPYRIGHTYEAR] [--dateCreated DATECREATED]
                  [--dateModified DATEMODIFIED] [--datePublished DATEPUBLISHED] [--description DESCRIPTION] [--developmentStatus DEVELOPMENTSTATUS] [--downloadUrl DOWNLOADURL]
                  [--editor EDITOR] [--email EMAIL] [--embargoDate EMBARGODATE] [--encoding ENCODING] [--familyName FAMILYNAME] [--fileFormat FILEFORMAT] [--fileSize FILESIZE]
                  [--funder FUNDER] [--funding FUNDING] [--givenName GIVENNAME] [--hasPart HASPART] [--id ID] [--identifier IDENTIFIER] [--installUrl INSTALLURL]
                  [--isAccessibleForFree ISACCESSIBLEFORFREE] [--isPartOf ISPARTOF] [--issueTracker ISSUETRACKER] [--keywords KEYWORDS] [--license LICENSE] [--maintainer MAINTAINER]
                  [--memoryRequirements MEMORYREQUIREMENTS] [--name NAME] [--operatingSystem OPERATINGSYSTEM] [--permissions PERMISSIONS] [--position POSITION]
                  [--processorRequirements PROCESSORREQUIREMENTS] [--producer PRODUCER] [--programmingLanguage PROGRAMMINGLANGUAGE] [--provider PROVIDER] [--publisher PUBLISHER]
                  [--readme README] [--referencePublication REFERENCEPUBLICATION] [--relatedLink RELATEDLINK] [--releaseNotes RELEASENOTES] [--runtimePlatform RUNTIMEPLATFORM]
                  [--sameAs SAMEAS] [--softwareHelp SOFTWAREHELP] [--softwareRequirements SOFTWAREREQUIREMENTS] [--softwareSuggestions SOFTWARESUGGESTIONS]
                  [--softwareVersion SOFTWAREVERSION] [--sponsor SPONSOR] [--storageRequirements STORAGEREQUIREMENTS] [--supportingData SUPPORTINGDATA]
                  [--targetProduct TARGETPRODUCT] [--type TYPE] [--url URL] [--version VERSION]
                  [inputsources ...]
codemetapy: error: unrecognized arguments: --no-extras

In codemetapy v2.2.0 is there a set of options that can get a similar effect as --no-extras?

Adapt to codemeta v3 release

Codemeta 3 was released recently. I will need to investigate what has changed and how these changes affect codemetapy, so that we can be compatible with the latest release (but hopefully also retain compatibility with the codemeta 2).

add ORCIDs

It would be nice if this added ORCIDs, or at least a placeholder for them to be filled in manually later

Error on windows? only

After creating a fresh venv on a windows system with python 3.10, installing codemetapy and executing it on a cloned repo I get the following error:

No input files specified, but found python project (pyproject.toml) in current dir, using that...
Downloading context from https://raw.githubusercontent.com/codemeta/codemeta/2.0/codemeta.jsonld
Traceback (most recent call last):
  File "C:\git\codemetapytest\.venv\Scripts\codemetapy-script.py", line 33, in <module>
    sys.exit(load_entry_point('CodeMetaPy==2.2.2', 'console_scripts', 'codemetapy')())
  File "C:\git\codemetapytest\.venv\lib\site-packages\codemeta\codemeta.py", line 136, in main
    g, res, args, contextgraph = build(**args.__dict__)
  File "C:\git\codemetapytest\.venv\lib\site-packages\codemeta\codemeta.py", line 263, in build
    g, contextgraph = init_graph(args.no_cache)
  File "C:\git\codemetapytest\.venv\lib\site-packages\codemeta\common.py", line 275, in init_graph
    init_context(no_cache)
  File "C:\git\codemetapytest\.venv\lib\site-packages\codemeta\common.py", line 257, in init_context
    with open(localfile, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp\\codemeta.jsonld'

The last line makes me think that this is an error about combining unix and windows filepath delimiters?

Implement a validation component using SHACL

In CLARIAH we're working on a SHACL graph (CLARIAH/tool-discovery#2, see also CLARIAH/clariah-plus#50) to allow validation of codemeta.json files and feedback to the providers.

A validation component needs to be implemented in codemetapy that can take an any SHACL file (because not everybody will agree on one across institutes/projects). I'm planning on using https://github.com/RDFLib/pySHACL for the implementation.

Though most will be in codemetapy, adaptions will also need to be made in codemeta-harvester and possibly codemeta-server .

Error on windows usage

When I use codemetapy on Windows, the below error has occured.

file://C:\Users\MUSTAF~1\AppData\Local\Temp\codemeta.jsonld does not look like a valid URI, trying to serialize this will break.
file://C:\Users\MUSTAF~1\AppData\Local\Temp\schemaorgcontext.jsonld does not look like a valid URI, trying to serialize this will break.
file://C:\Users\MUSTAF~1\AppData\Local\Temp\stype.jsonld does not look like a valid URI, trying to serialize this will break.
file://C:\Users\MUSTAF~1\AppData\Local\Temp\iodata.jsonld does not look like a valid URI, trying to serialize this will break.
file://C:\Users\MUSTAF~1\AppData\Local\Temp\repostatus.jsonld does not look like a valid URI, trying to serialize this will break.

I checked where the error occurs and it looks like when creating file path, you add 'file://' to the string, such as below.

SCHEMA_LOCAL_SOURCE = "file://" + os.path.join(TMPDIR, "schemaorgcontext.jsonld")
CODEMETA_LOCAL_SOURCE = "file://" + os.path.join(TMPDIR, "codemeta.jsonld")
STYPE_LOCAL_SOURCE = "file://" + os.path.join(TMPDIR, "stype.jsonld")
IODATA_LOCAL_SOURCE = "file://" + os.path.join(TMPDIR, "iodata.jsonld")
REPOSTATUS_LOCAL_SOURCE = "file://" + os.path.join(TMPDIR, "repostatus.jsonld")

If you check this issue, I opened in rdflib, you can see rdflib make a string replacement like below.

    if absolute_location.startswith("file:///"):
        filename = url2pathname(absolute_location.replace("file:///", "/"))
        file = open(filename, "rb")
    else:
        input_source = URLInputSource(absolute_location, format)

Since Windows paths does not start with '/', when you only add 'file://' to the beginning, rdflib cannot read the files and the errors I mentioned occurs.

Implement filters in HTML visualisation

A simple form of faceted search should be implemented in the html visualisation. The query backend for this is already in place, it only needs some convenient access from the frontend.

Feature request: pre-commit hook to update codemeta.json

It would be great if codemetapy could be used as a pre-commit hook to automatically synchronize e.g. pyproject.toml to the codemeta.json.

Currently I have to add it to my dev dependencies and do this:

  - repo: local
    hooks:
      - id: codemetapy
        name: codemetapy
        language: system
        entry: poetry run codemetapy -O codemeta.json
        files: ^pyproject.toml$

Ideally, I could just point it to this repository any other hook

  - repo: https://github.com/proycon/codemetapy
    rev: 'X.Y.Z'
    hooks:
      - id: codemetapy

and it should do something "smart" by default 🙂

Improve reconciliation algorithm

The reconciliation algorithm that merges data from multiple sources (= multiple RDF graphs) needs some further work:

  • Detect and remove duplicate people in authors/maintainers/contributors
  • Merging blank nodes from multiple graphs (#12) may still lead to some incorrect results: maybe we can tackle this and the above one by assigning some temporary IDs which we strip later
  • Expand the list of properties that may only take a single value
  • Handle merging of dates (favouring the earlier/latest date in case of a conflict)

Dealing with ordered lists

People express authors as lists in JSON-LD but if this is done without the @list semantics (or without schema:position), the order remains undefined in RDF (unlike in JSON itself). We currently support schema:position in codemetapy but we may want actual support for@list so people can express for instance authorship like:

"author": [ "First Author", "Second Author" ]

This is in line with discussions in codemeta/codemeta#272 . The codemeta context indeed has "@container": "@list" for things like properties like author, as discussed in that issue. In our current implementation, however, we explicitly load (and even forcibly inject) the whole schema.org context as well, after the codemeta one, and the regular schema.org context does not use "@container": "@list".

Context parsed wrong?

Currently codemetapy often generates a context for the jsonLD like this:

    "@context": [
        "https://raw.githubusercontent.com/codemeta/codemeta/2.0/codemeta.jsonld",
        "https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/releases/13.0/schemaorgcontext.jsonld",
        "https://w3id.org/software-types",
        "https://w3id.org/software-iodata"
    ],

While codemeta does expect a context like this:

"@context": "https://doi.org/10.5063/schema/codemeta-2.0",

Is there any good reason for the first context list?
I know that usual context can be nested and complex and codemeta.json would do well to allow for easy extensions or embeding.
While it does not, the harvester should put the right context in my opinion, at least the right reference to the codemeta schema.

Implement support for pyproject.toml

This is a continuation of #16

Currently either a setup.py is still needed, or the package must be installed. Newer projects, however, use a pyproject.toml and may use other build systems such as poetry. We want to be able to extract metadata (standardized in PEP-621) directly from pyproject.toml. It seems the current method using importlib.metadata does not provide any solution for this so some additional logic is needed.

Allow support for multiple author libraries

It seems (though I haven't looked too deeply yet) that CodeMetaPy currently assumes that there is only a single author

if key == "Author":
humanname = HumanName(value.strip())
author = {"@type":"Person", "givenName": humanname.first, "familyName": " ".join((humanname.middle, humanname.last)).strip() }

which results in

(codemeta-example) $ python -m pip install -q --upgrade pip setuptools wheel
(codemeta-example) $ python -m pip install -q codemetapy "pyhf==0.5.2"
(codemeta-example) $ python -m pip list | grep CodeMetaPy
CodeMetaPy         0.3.4
$ codemetapy --no-extras pyhf > codemeta.json
$ head -n 19 codemeta.json
{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "http://schema.org"
    ],
    "@type": "SoftwareSourceCode",
    "identifier": "pyhf",
    "name": "pyhf",
    "version": "0.5.2",
    "description": "(partial) pure Python HistFactory implementation",
    "license": "Apache, OSI Approved :: Apache Software License",
    "author": [
        {
            "@type": "Person",
            "givenName": "Matthew",
            "familyName": "Feickert Lukas Heinrich",
            "email": "[email protected], [email protected], [email protected]"
        }
    ],

Is it possible to try to determine if there are multiple authors present? We could alternatively try to reformat the way the authors are listed in the metadata, though I think that having them just be comma separated isn't too uncommon.

Please let me know if you'd like any help on this in anyway.

Possible bug: Serialization to JSON is not deterministic

I'm using codemetapy as a linter with pre-commit and noticed that for some reason it fails sometimes because the codemeta.json file changed.

In my case, this is caused by my classifiers:

classifiers = [
    "Operating System :: POSIX :: Linux",
    "Development Status :: 3 - Alpha",
    "Intended Audience :: Science/Research",
    "Intended Audience :: Developers",
]

Then codemetapy will randomly orders them, sometimes resulting in:

    "audience": [
        {
            "@type": "Audience",
            "audienceType": "Science/Research"
        },
        {
            "@type": "Audience",
            "audienceType": "Developers"
        }
    ],

and sometimes in:

    "audience": [
        {
            "@type": "Audience",
            "audienceType": "Developers"
        },
        {
            "@type": "Audience",
            "audienceType": "Science/Research"
        }
    ],

Would it be possible to make sure that a consistent order is ensured?

As a workaround I currently add a global exclude: '^codemeta.json$' to my hook, which is not nice because it prevents my JSON linter to format the JSON.

Expected behavior

On the same input files exactly the same output file should be generated.

json-ld parsing error

When a resource with explicit URI occurs multiple times in the input, then parsing embedded structured goes wrong:

Example input. This entire structure occurs multiple times in the input:

{
            "@type": "Person",
            "affiliation": {
                "@id": "https://www.ru.nl/clst",
                "@type": "Organization",
                "name": "Centre for Language and Speech Technology",
                "parentOrganization": {
                    "@id": "https://www.ru.nl/cls",
                    "@type": "Organization",
                    "name": "Centre for Language Studies",
                    "parentOrganization": {
                        "@id": "https://www.ru.nl",
                        "@type": "Organization",
                        "location": {
                            "@type": "Place",
                            "name": "Nijmegen"
                        },
                        "name": "Radboud University",
                        "url": "https://www.ru.nl"
                    },
                    "url": "https://www.ru.nl/cls"
                },
                "url": "https://www.ru.nl/clst"
            }
}

But when parsing and reserialising it, the item that did NOT have a URI (location) gets a new stub one on each parse (which is hidden away again when serialised), and therefore we end up with something like:

{
            "@type": "Person",
            "affiliation": {
                "@id": "https://www.ru.nl/clst",
                "@type": "Organization",
                "name": "Centre for Language and Speech Technology",
                "parentOrganization": {
                    "@id": "https://www.ru.nl/cls",
                    "@type": "Organization",
                    "name": "Centre for Language Studies",
                    "parentOrganization": {
                        "@id": "https://www.ru.nl",
                        "@type": "Organization",
                        "location": [{
                            "@type": "Place",
                            "name": "Nijmegen"
                        },
                       {
                            "@type": "Place",
                            "name": "Nijmegen"
                        },
                        {
                            "@type": "Place",
                            "name": "Nijmegen"
                        }],
                        "name": "Radboud University",
                        "url": "https://www.ru.nl"
                    },
                    "url": "https://www.ru.nl/cls"
                },
                "url": "https://www.ru.nl/clst"
            }
}

Incorrect parsing of versions from dependencies if e.g. extras are stated to be installed

pyproject.toml:

[tool.poetry]
name = "dummy-project"
version = "0.1.0"
description = ""
authors = ["John Doe <[email protected]>"]
readme = "README.md"
packages = [{include = "dummy_project"}]

[tool.poetry.dependencies]
python = "^3.8"
pyproject-parser = "^0.9.0"
typer = {extras = ["all"], version = "^0.7.0"}
tomli = "^2.0.1"
codemetapy = "^2.5.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Log:

No input files specified, but found python project (pyproject.toml) in current dir, using that...
Note: You did not specify a --baseuri so we will not provide identifiers (IRIs) for your SoftwareSourceCode resources (and others)
Initial URI automatically generated, may be overriden later: file:///pyproject-toml
Processing source #1 of 1
Obtaining python package metadata for: pyproject.toml
Loading metadata from pyproject.toml via pyproject-parser
WARNING: No translation for distutils or pyproject.toml key readme
WARNING: No translation for distutils or pyproject.toml key packages
Found dependency python ^3.8
Found dependency pyproject-parser ^0.9.0
Found dependency typer {'extras': ['all'
Found dependency 'version': '^0.7.0'}
Found dependency tomli ^2.0.1
Found dependency codemetapy ^2.5.0
[CODEMETA COMPOSITION (dummy-project)] processed 48 new triples, total is now 49
[CODEMETA VALIDATION (dummy-project)] codeRepository not set
[CODEMETA VALIDATION (dummy-project)] license not set
[CODEMETA VALIDATION (dummy-project)] done
{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "https://w3id.org/software-iodata",
        "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
        "https://schema.org",
        "https://w3id.org/software-types"
    ],
    "@type": "SoftwareSourceCode",
    "author": [
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Doe",
            "givenName": "John"
        }
    ],
    "description": "",
    "identifier": "dummy-project",
    "name": "dummy-project",
    "runtimePlatform": "Python 3",
    "softwareRequirements": [
        {
            "@type": "SoftwareApplication",
            "identifier": "'version':",
            "name": "'version':",
            "runtimePlatform": "Python 3",
            "version": "'^0.7.0'}"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "codemetapy",
            "name": "codemetapy",
            "runtimePlatform": "Python 3",
            "version": "^2.5.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pyproject-parser",
            "name": "pyproject-parser",
            "runtimePlatform": "Python 3",
            "version": "^0.9.0"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "python",
            "name": "python",
            "runtimePlatform": "Python 3",
            "version": "^3.8"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "tomli",
            "name": "tomli",
            "runtimePlatform": "Python 3",
            "version": "^2.0.1"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "typer",
            "name": "typer",
            "runtimePlatform": "Python 3",
            "version": "{'extras': ['all'"
        }
    ],
    "version": "0.1.0"
}

So apparently the parsing does not work if the deps are not simply a string, but have e.g. extras stated.

The solution would be to check whether the versions are strings, if not to check if they are dicts and use the version key on them.

Feature request: codemetapy does not extract maintainer from package.json or pyproject.toml

package.json has a maintainers list, also both the PEP-compliant and poetry-based pyproject.toml have support to list project maintainers, which is quite important information.

codemeta apparently also has a maintainer field: https://codemeta.github.io/user-guide/

Would be nice if it would use e.g. the first maintainer from the list and add as the codemeta (don't know what codemeta allows, but actually being able to list all maintainers would be even better).

codemetapy output is not static across identical runs

It seems that in codemetapy v2.0+ the output of subsequent runs is not static. This makes it impossible to be able to diff the output between runs (without doing manual sorting by yourself).

Example:

> docker run --rm -ti python:3.10 /bin/bash
root@68d255b7e087:/# python -m venv venv && . venv/bin/activate
(venv) root@68d255b7e087:/# python -m pip --quiet install --upgrade pip setuptools wheel
(venv) root@68d255b7e087:/# python -m pip --quiet install --pre 'pyhf==0.7.0rc4'
(venv) root@68d255b7e087:/# python -m pip --quiet install 'codemetapy==2.2.1'
(venv) root@68d255b7e087:/# codemetapy --inputtype python --no-extras pyhf > codemeta_run1.json
...
(venv) root@68d255b7e087:/# codemetapy --inputtype python --no-extras pyhf > codemeta_run2.json
...
(venv) root@68d255b7e087:/# apt update && apt install -y jq
(venv) root@68d255b7e087:/# diff <(jq -S .softwareRequirements codemeta_run1.json) <(jq -S .softwareRequirements codemeta_run2.json)
11,18d10
<     "@id": "/dependency/pyyaml-ge-5.1",
<     "@type": "SoftwareApplication",
<     "identifier": "pyyaml",
<     "name": "pyyaml",
<     "runtimePlatform": "Python 3",
<     "version": ">=5.1"
<   },
<   {
26a19,26
>     "@id": "/dependency/importlib-resources-ge-1.4.0",
>     "@type": "SoftwareApplication",
>     "identifier": "importlib-resources",
>     "name": "importlib-resources",
>     "runtimePlatform": "Python 3",
>     "version": ">=1.4.0"
>   },
>   {
35c35
<     "@id": "/dependency/scipy-ge-1.1.0",
---
>     "@id": "/dependency/tqdm-ge-4.56.0",
37,38c37,38
<     "identifier": "scipy",
<     "name": "scipy",
---
>     "identifier": "tqdm",
>     "name": "tqdm",
40c40
<     "version": ">=1.1.0"
---
>     "version": ">=4.56.0"
43c43
<     "@id": "/dependency/jsonschema-ge-4.15.0",
---
>     "@id": "/dependency/pyyaml-ge-5.1",
45,46c45,46
<     "identifier": "jsonschema",
<     "name": "jsonschema",
---
>     "identifier": "pyyaml",
>     "name": "pyyaml",
48c48
<     "version": ">=4.15.0"
---
>     "version": ">=5.1"
51c51
<     "@id": "/dependency/importlib-resources-ge-1.4.0",
---
>     "@id": "/dependency/scipy-ge-1.1.0",
53,54c53,54
<     "identifier": "importlib-resources",
<     "name": "importlib-resources",
---
>     "identifier": "scipy",
>     "name": "scipy",
56c56
<     "version": ">=1.4.0"
---
>     "version": ">=1.1.0"
59c59
<     "@id": "/dependency/tqdm-ge-4.56.0",
---
>     "@id": "/dependency/jsonschema-ge-4.15.0",
61,62c61,62
<     "identifier": "tqdm",
<     "name": "tqdm",
---
>     "identifier": "jsonschema",
>     "name": "jsonschema",
64c64
<     "version": ">=4.56.0"
---
>     "version": ">=4.15.0"
(venv) root@68d255b7e087:/#

In codemetapy v0.3.5 the output was statically reproducible across runs. Is this something that could be supported again? Or should users sort the JSON manually if they want it?

For an example of how this affects workflows c.f. scikit-hep/pyhf#2002

Incorrect parsing of versions from dependencies if e.g. extras are stated to be installed

The fix in #42 apparently did not work, the problem still persists with codemetapy 2.5.1:

[tool.poetry]
name = "somesy"
version = "0.1.0"
description = "A CLI tool for synchronizing software project metadata."
authors = ["Mustafa Soylu <[email protected]>", "Anton Pirogov <[email protected]>"]
maintainers = ["Mustafa Soylu <[email protected]>"]
license = "MIT"

include = [
  "*.md", "LICENSE", "LICENSES", ".reuse/dep5", "CITATION.cff", "codemeta.json",
  { path = "mkdocs.yml", format = "sdist" },
  { path = "docs", format = "sdist" },
  { path = "tests", format = "sdist" },
]

[tool.poetry.dependencies]
python = "^3.8"
pydantic = {extras = ["email"], version = "^1.9.2"}
typer = {extras = ["all"], version = "^0.7.0"}

[tool.poetry.group.docs]
optional = true

[tool.poetry.group.docs.dependencies]
mkdocstrings = {extras = ["python"], version = "^0.21.2"}
markdown-exec = {extras = ["ansi"], version = "^1.6.0"}

Output:

Passed 1 files/sources but specified 0 input types! Automatically guessing types...
Detected input types: [('pyproject.toml', 'python')]
Note: You did not specify a --baseuri so we will not provide identifiers (IRIs) for your SoftwareSourceCode resources (and others)
Initial URI automatically generated, may be overriden later: file:///pyproject-toml
Processing source #1 of 1
Obtaining python package metadata for: pyproject.toml
Loading metadata from pyproject.toml via pyproject-parser
WARNING: No translation for distutils or pyproject.toml key include
Found dependency python ^3.8
Found dependency pydantic {'extras': ['email'
Found dependency 'version': '^1.9.2'}
Found dependency typer {'extras': ['all'
Found dependency 'version': '^0.7.0'}
WARNING: No translation for distutils or pyproject.toml key group
[CODEMETA COMPOSITION (somesy)] processed 50 new triples, total is now 51
[CODEMETA VALIDATION (somesy)] codeRepository not set
[CODEMETA VALIDATION (somesy)] done
{
    "@context": [
        "https://doi.org/10.5063/schema/codemeta-2.0",
        "https://w3id.org/software-iodata",
        "https://raw.githubusercontent.com/jantman/repostatus.org/master/badges/latest/ontology.jsonld",
        "https://schema.org",
        "https://w3id.org/software-types"
    ],
    "@type": "SoftwareSourceCode",
    "author": [
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Pirogov",
            "givenName": "Anton"
        },
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Soylu",
            "givenName": "Mustafa"
        }
    ],
    "description": "A CLI tool for synchronizing software project metadata.",
    "identifier": "somesy",
    "license": "http://spdx.org/licenses/MIT",
    "maintainer": {
        "@type": "Person",
        "email": "[email protected]",
        "familyName": "Soylu",
        "givenName": "Mustafa"
    },
    "name": "somesy",
    "runtimePlatform": "Python 3",
    "softwareRequirements": [
        {
            "@type": "SoftwareApplication",
            "identifier": "'version':",
            "name": "'version':",
            "runtimePlatform": "Python 3",
            "version": "'^1.9.2'}"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "'version':",
            "name": "'version':",
            "runtimePlatform": "Python 3",
            "version": "'^0.7.0'}"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "pydantic",
            "name": "pydantic",
            "runtimePlatform": "Python 3",
            "version": "{'extras': ['email'"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "python",
            "name": "python",
            "runtimePlatform": "Python 3",
            "version": "^3.8"
        },
        {
            "@type": "SoftwareApplication",
            "identifier": "typer",
            "name": "typer",
            "runtimePlatform": "Python 3",
            "version": "{'extras': ['all'"
        }
    ],
    "version": "0.1.0"
}

I had no option to re-open the existing issue, so I'm creating a new one.

graph creation of many entries fails with rekursion depth error

I am not sure if this is related to the number of jsonld files or the content. Number of jsonLD files >1200.
Generation of subgraphs of this data is possible so I assume the quantity is the issue also because of an recursion error. But I have yet not tried out if I can create a subgraph for really ALL these files or not.

codemetapy --graph codemeta_results/git_*/*/*/codemeta_*.json > graph.json
...
  File "/home//work/git/codemetapy/codemeta/serializers/jsonld.py", line 182, in embed_items
    return embed_items(itemmap[data[idkey]], itemmap, copy(history))
  File "/usr/lib/python3.8/copy.py", line 72, in copy
    cls = type(x)
RecursionError: maximum recursion depth exceeded while calling a Python object

So the current graph serializer does not scale. I had this for different jsonld file sets where it fails after 2000 files or so and errors after different files.

The default python recursion depth is around 1000.

Person ids wrong gitlab

codemetapy usually generates the wrong person urls.

like it creates

https://iffgit.fz-juelich.de/fleur/fleur/person/ingo-heimbach

instead of

https://iffgit.fz-juelich.de/ingo-heimbach

for gitlabs usually the name comes right after the main url. For github this is also the case.

Also sometime it fails to parse the person like for https://gitlab.desy.de/benjamin.bastian/mpl_styles
it ends up with:

https://gitlab.desy.de/benjamin.bastian/mpl_styles/person/unknown

codemetapy fails to merge triples for the same person

File in1.json:

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "@id": "https://orcid.org/0000-1234-5678-9101",
      "@type": "Person",
      "familyName": "Doe",
      "givenName": "John"
    }
  ],
  "codeRepository": "https://github.com/example/repository",
  "description": "an example",
  "name": "example",
  "version": "0.1.0"
}

File in2.json:

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "email": "[email protected]",
      "@type": "Person",
      "familyName": "Doe",
      "givenName": "John"
    }
  ],
  "codeRepository": "https://github.com/example/repository",
  "description": "an example",
  "name": "example",
  "version": "0.1.0"
}

Run codemetapy in1.json in2.json

Expected result:

Person will have both email and orcid

Actual result:

Person has only email (when passed in this order) or only orcid (when passing in2.json before in1.json)

Bug: codemetapy fails to add orcids to all contributors listed in a package.json

Input package.json:

{
  "name": "somesy",
  "version": "0.1.0",
  "description": "A cli tool for synchronizing CITATION.CFF with project files.",
  "keywords": [
    "metadata",
    "FAIR"
  ],
  "license": "MIT",
  "repository": {
    "type": "git",
    "url": "https://github.com/Materials-Data-Science-and-Informatics/somesy"
  },
  "homepage": "https://materials-data-science-and-informatics.github.io/somesy",
  "author": {
    "name": "Mustafa Soylu",
    "email": "[email protected]",
    "url": "https://orcid.org/0000-0003-2637-0432"
  },
  "contributors": [
    {
      "name": "Mustafa Soylu",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0003-2637-0432"
    },
    {
      "name": "Anton Pirogov",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0002-5077-7497"
    },
    {
      "name": "Jens Br\u00f6der",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0001-7939-226X"
    }
  ],
  "main": "index.js",
  "scripts": {
    "test": "echo \"No tests available\""
  },
  "dependencies": {
    "lodash": "^4.17.21",
    "axios": "^0.21.1"
  },
  "devDependencies": {
    "jest": "^27.0.6"
  }
}

Resulting codemeta.json:

[...]
    "contributor": [
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Soylu",
            "givenName": "Mustafa",
            "url": "https://orcid.org/0000-0003-2637-0432"
        },
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Pirogov",
            "givenName": "Anton"
        },
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Bröder",
            "givenName": "Jens"
        }
    ],
[...]

So the ORCID / Url is not preserved except for the first listed contributor.

Metadata from pypi, or internal

Just an idea:

Pypi has an API for the basic metadata, for example see: https://pypi.org/pypi/codemetapy/json

one could use that as a backup if the parsing directly fails... (through then there is probably also nothing on pypi).

You use the orginal metadata parser from pkg and importlib_metadata, right?

One could also request the metadata for each python dependency that way to fill from that the metadata of the dependencies required for codemeta, or parse the metadata from the dependencies (which would only work if installed or?).

Many keys from pyproject.toml are not parsed

@proycon Thanks for you work!

given pyproject.toml

[project]
name = "project"
description = "Description."
dynamic = ['version']
authors = [{name = "author1", email = "[email protected]"}, 
{name = "author2", email = "[email protected]"},
{name = "author3", email = "author3@e-mail}]
readme = "README.md"
license = {file = "LICENSE.txt"}
classifiers = [
        "Development Status :: 4 - Beta",
        "Intended Audience :: Information Technology",
        "Intended Audience :: Science/Research",
        "License :: OSI Approved :: MIT License",
        "Natural Language :: English",
        "Operating System :: POSIX :: Linux",
        "Operating System :: MacOS :: MacOS X",
        "Programming Language :: Python",
        "Programming Language :: Python :: 3.8",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
        "Programming Language :: Python :: 3.12",
        "Topic :: Database :: Front-Ends",
        "Topic :: Education",
        "Topic :: Scientific/Engineering",
        "Topic :: Scientific/Engineering :: Information Analysis",
        "Topic :: Scientific/Engineering :: Visualization",

]
keywords = ["dashboard", "data visualization", "survey data", "categorical data", 
"interactive visualization", "bokeh", "panel"]
....
#requirements in poetry

This results in:

{
    "@context": [
        "https://raw.githubusercontent.com/codemeta/codemeta/2.0/codemeta.jsonld",
        "https://w3id.org/software-iodata",
        "https://raw.githubusercontent.com/schemaorg/schemaorg/main/data/releases/13.0/schemaorgcontext.jsonld",
        "https://w3id.org/software-types"
    ],
    "@id": "/project,
    "@type": "SoftwareSourceCode",
    "author": {
        "@id": "/person/author1",
        "@type": "Person",
        "email": "[email protected]",
        "familyName": "",
        "givenName": "author1"
    },
    "description": "Description",
    "identifier": "project,
    "license": "http://spdx.org/licenses/MIT",
    "name": "project",
    "runtimePlatform": [
        "Python 3",
        "Python 3.10",
        "Python 3.8",
        "Python 3.9"
    ],
    "softwareRequirements": [
        {
            "@id": "/dependency/bokeh-ge-2.4.3,-lt-3.0.0",
            "@type": "SoftwareApplication",
            "identifier": "bokeh",
            "name": "bokeh",
            "runtimePlatform": "Python 3",
            "version": ">=2.4.3,<3.0.0"
        },
        {
            "@id": "/dependency/pandas-ge-1.4.1,-lt-2.0.0",
            "@type": "SoftwareApplication",
            "identifier": "pandas",
            "name": "pandas",
            "runtimePlatform": "Python 3",
            "version": ">=1.4.1,<2.0.0"
        },
        {
            "@id": "/dependency/panel-ge-0.13.1,-lt-0.14.0",
            "@type": "SoftwareApplication",
            "identifier": "panel",
            "name": "panel",
            "runtimePlatform": "Python 3",
            "version": ">=0.13.1,<0.14.0"
        },
        {
            "@id": "/dependency/wordcloud-ge-1.8.2.2,-lt-2.0.0.0",
            "@type": "SoftwareApplication",
            "identifier": "wordcloud",
            "name": "wordcloud",
            "runtimePlatform": "Python 3",
            "version": ">=1.8.2.2,<2.0.0.0"
        }
    ],
    "version": "1.0.0"
}

So it missed, the following keys:

  • most of the classifiers
  • the other authors and the names
  • the readme
  • the keywords
    (maybe it depends on the importlib_metadataversion?)

I was not expecting it to get the requirements, but it nicely did. But it missed all optional requirements.

Do you plan to integrate features of codemetar like parsing the README or CITATION.cff files or so?
The best data for authors will be in a CITAION.cff.

Is there a merge strategy. I.e one could then generate the codemeta.json from several sources, i.e local repo code, github API, different builds etc, and merge them in case on a certain root more metadata is gathered? From the docs, this seems to work if providing these sources as once, but can it also merge with an existing file (for the usecase that this is manual adopted since something will not work automatically and everything else I want to update automatically on pre-commit)?

Bug: codemetapy incorrectly expands a url into a nested Person inside of a Person

Inputs:

package.json:

{
  "name": "somesy",
  "version": "0.1.0",
  "description": "A cli tool for synchronizing CITATION.CFF with project files.",
  "keywords": [
    "metadata",
    "FAIR"
  ],
  "license": "MIT",
  "repository": {
    "type": "git",
    "url": "https://github.com/Materials-Data-Science-and-Informatics/somesy"
  },
  "homepage": "https://materials-data-science-and-informatics.github.io/somesy",
  "author": {
    "name": "Mustafa Soylu",
    "email": "[email protected]",
    "url": "https://orcid.org/0000-0003-2637-0432"
  },
  "contributors": [
    {
      "name": "Mustafa Soylu",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0003-2637-0432"
    },
    {
      "name": "Anton Pirogov",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0002-5077-7497"
    },
    {
      "name": "Jens Br\u00f6der",
      "email": "[email protected]",
      "url": "https://orcid.org/0000-0001-7939-226X"
    }
  ],
  "main": "index.js",
  "scripts": {
    "test": "echo \"No tests available\""
  },
  "dependencies": {
    "lodash": "^4.17.21",
    "axios": "^0.21.1"
  },
  "devDependencies": {
    "jest": "^27.0.6"
  }
}

extra_codemeta.json:

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "author": [
    {
      "@id": "https://orcid.org/0000-0003-2637-0432",
      "@type": "Person",
      "familyName": "Soylu",
      "givenName": "Mustafa"
    },
    {
      "@id": "https://orcid.org/0000-0002-5077-7497",
      "@type": "Person",
      "familyName": "Pirogov",
      "givenName": "Anton"
    }
  ],
  "codeRepository": "https://github.com/Materials-Data-Science-and-Informatics/somesy",
  "description": "A cli tool for synchronizing CITATION.CFF with project files.",
  "keywords": [
    "metadata",
    "FAIR"
  ],
  "license": "https://spdx.org/licenses/MIT",
  "name": "somesy",
  "url": "https://materials-data-science-and-informatics.github.io/somesy",
  "version": "0.1.0"
}

Output (codemetapy package.json extra_codemeta.json):

[...]
    "contributor": [
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Soylu",
            "givenName": "Mustafa",
            "url": {
                "@id": "https://orcid.org/0000-0003-2637-0432",
                "@type": "Person",
                "familyName": "Soylu",
                "givenName": "Mustafa"
            }
        },
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Pirogov",
            "givenName": "Anton"
        },
        {
            "@type": "Person",
            "email": "[email protected]",
            "familyName": "Bröder",
            "givenName": "Jens"
        }
    ],
[...]

The fact that the Orcid / URL field vanishes for other contributors I reported in #45

but the other independent problem is that apparently something in the merge goes wrong - the url is expanded into another person object, leading to this weird and incorrect nested Person entry.

Expected: The url is not expanded into a nested Person. Either map it to @id, or keep it as url (because url could also be a homepage, and that is a bad @id if you merge from different sources could cause problems assuming that url is a good Person @id, unless its an ORCID), I don't know, but it should remain a string.

installation on macOS fails

pip install codemetapy fails with following message

Collecting codemetapy
  Downloading https://files.pythonhosted.org/packages/ad/7f/5b1c63961441b77cb0a1b266a2bed616697c89af785281371e8be072793c/CodeMetaPy-0.2.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip-install-bT1rfX/codemetapy/setup.py", line 22, in <module>
        long_description=read('README.rst'),
      File "/private/tmp/pip-install-bT1rfX/codemetapy/setup.py", line 10, in read
        return open(os.path.join(os.path.dirname(__file__), fname),'r',encoding='utf-8').read()
    TypeError: 'encoding' is an invalid keyword argument for this function
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/tmp/pip-install-bT1rfX/codemetapy/

Add ability to filter out extras from setup.py

Hi. Thanks very much for making codemetapy — it has been quite useful. We're using it for our project pyhf but we have quite a few extras in our setup.py to make it easy for us to setup different environments in CI and elsewhere. At the moment codemetapy tries to include information on all of our extras, which is not necessarily the desired output.

For example, in a new Python virtual environment if you do

(codemeta-example) $ python -m pip install -q --upgrade pip setuptools wheel
(codemeta-example) $ python -m pip install -q codemetapy "pyhf==0.5.2"
(codemeta-example) $ python -m pip list
Package            Version
------------------ -------
attrs              20.2.0
click              7.1.2
CodeMetaPy         0.3.2
importlib-metadata 2.0.0
jsonpatch          1.26
jsonpointer        2.0
jsonschema         3.2.0
nameparser         1.0.6
numpy              1.19.2
pip                20.2.3
pkg-resources      0.0.0
pyhf               0.5.2
pyrsistent         0.17.3
PyYAML             5.3.1
scipy              1.5.2
setuptools         50.3.0
six                1.15.0
tqdm               4.50.0
wheel              0.35.1
zipp               3.3.0
(codemeta-example) $ codemetapy pyhf > codemeta.json
Processing source #1 of 1
Obtaining python package metadata for: pyhf
Found metadata in  /home/feickert/.venvs/codemeta-example/lib/python3.7/site-packages/pyhf-0.5.2.dist-info
WARNING: No translation for distutils key Metadata-Version
WARNING: No translation for distutils key Project-URL
WARNING: No translation for distutils key Project-URL
WARNING: No translation for distutils key Project-URL
WARNING: No translation for distutils key Platform
WARNING: No translation for distutils key Requires-Python
WARNING: No translation for distutils key Description-Content-Type
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra
WARNING: No translation for distutils key Provides-Extra

You end up with a very long codemeta.json that has redundancies in it due to it not parsing ~= correctly and trying to deal with all the extras

(codemeta-example) $ wc -l codemeta.json
1600 codemeta.json

For comparison, the codemeta.json that I ended up creating after removing the entries from "softwareRequirements" that were from extras is

$ wc -l codemeta.json 
127 codemeta.json

If it would be possible to add a --no-extras flag or something along those lines of functionality that would be fantastic. Regardless, thank you for helping make software better!

Implement a proper test suite

Thus far we've done without a test suite, but that's clearly suboptimal and not sustainable. Unit and integration tests need to be written and a Continuous integration pipeline needs to be set up.

(Rough hour estimate: 20 hours)

Pyproject-based parser fails on some valid input files

Here is a valid pyproject.toml on which codemetapy fails, even though it should not:

[tool.poetry]
name = "dummy-project"
version = "0.1.0"
description = ""
authors = ["John Doe <[email protected]"]
readme = "README.md"
packages = [{include = "dummy_project"}]

include = [
  # having both a string and an object here seems to trigger the problem:
  "CHANGELOG.md",
  { path = "tests", format = "sdist" },
  # ----
]

[tool.poetry.dependencies]
python = "^3.8"


[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

Stacktrace for codemetapy pyproject.toml:

Passed 1 files/sources but specified 0 input types! Automatically guessing types...
Detected input types: [('pyproject.toml', 'python')]
Note: You did not specify a --baseuri so we will not provide identifiers (IRIs) for your SoftwareSourceCode resources (and others)
Initial URI automatically generated, may be overriden later: file:///pyproject.toml
Processing source #1 of 1
Obtaining python package metadata for: pyproject.toml
Loading metadata from pyproject.toml via pyproject-parser
Failed to process pyproject.toml via pyproject-parser: list index out of range
Fallback: Loading metadata from pyproject.toml via PEP517
Traceback (most recent call last):
  File "/home/a.pirogov/.local/bin/codemetapy", line 8, in <module>
    sys.exit(main())
  File "/local/home/a.pirogov/.local/pipx/venvs/codemetapy/lib/python3.8/site-packages/codemeta/codemeta.py", line 148, in main
    g, res, args, contextgraph = build(**args.__dict__)
  File "/local/home/a.pirogov/.local/pipx/venvs/codemetapy/lib/python3.8/site-packages/codemeta/codemeta.py", line 390, in build
    codemeta.parsers.python.parse_python(newgraph, res, source, crosswalk, args)
  File "/local/home/a.pirogov/.local/pipx/venvs/codemetapy/lib/python3.8/site-packages/codemeta/parsers/python.py", line 152, in parse_python
    packagename = pkg.name
AttributeError: 'PathDistribution' object has no attribute 'name'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.