mapping-commons / sssom-py Goto Github PK

View Code? Open in Web Editor NEW

48.0 6.0 10.0 17.33 MB

Python toolkit for SSSOM mapping format

Home Page: https://mapping-commons.github.io/sssom-py/index.html#

License: MIT License

Makefile 0.59% Shell 1.21% Python 98.15% Batchfile 0.04%

sssom obofoundry python mappings ontology-mappings linkml

sssom-py's Introduction

Python Utilities for SSSOM

SSSOM (Simple Standard for Sharing Ontology Mappings) is a TSV and RDF/OWL standard for ontology mappings

WARNING: 
    The export formats (json, rdf) of sssom-py are not yet finalised! 
    Please expect changes in future releases!

See https://github.com/OBOFoundry/SSSOM

This is a python library and command line toolkit for working with SSSOM. It also defines a schema for SSSOM.

Documentation

See documentation

Deploy documentation

make sphinx
make deploy-docs

Schema

See the schema/ folder for source schema in YAML, plus derivations to JSON-Schema, ShEx, etc.

Testing

tox is similar to make, but specific for Python software projects. Its configuration is stored in tox.ini in different "environments" whose headers look like [testenv:...]. All tests can be run with:

$ pip install tox
$ tox

A specific environment can be run using the -e flag, such as tox -e lint to run the linting environment.

Outstanding Contributors

Outstanding contributors are groups and institutions that have helped with organising the SSSOM Python package's development, providing funding, advice and infrastructure. We are very grateful for all your contribution - the project would not exist without you!

Harvard Medical School

The INDRA Lab, a part of the Laboratory of Systems Pharmacology and the Harvard Program in Therapeutic Science (HiTS), is interested in natural language processing and large-scale knowledge assembly. Their work on SSSOM is funded by the DARPA Young Faculty Award W911NF2010255 (PI: Benjamin M. Gyori).

https://indralab.github.io

sssom-py's People

Contributors

Stargazers

Watchers

Forkers

jiaola matentzn cthoyt joeflack4 agroportal sierra-moxon gbv anitacaron vimala88 bgyori

sssom-py's Issues

Make sure the export methods work

So far we have focussed on sssom imports, but here we need to make sure the export formats work well.

Create user docs with Sphinx or similar

I don't think the docstring of a command line is really the place where people show examples. You have manpages for most CLI-only tools, and SSSOM should also have user documentation with Sphinx. This is where usage and examples should go.

Originally posted by @cthoyt in #50 (comment)

Add a diff command

Given two mapping files, it would be useful to know which mappings are unique to each file and which are shared in common

Different criteria could be applied to determine what counts as mappings are unique. The most useful may be compare distinct pairs, such that even if two files differ on predicate, confidence, mapping fields etc, we get a broad picture of how they differ.

It would be useful to get results back as an SSSOM file, which is the merger of the two files, with each row annotated as to whether it is 1, 2, or both

Review our use of CLI with Kent

@kshefchek later in our meeting will ask you something

What was the original idea to pass around contexts?

All throughout SSSOM we see this context parameter to functions. I have always kinda thought of it as the prefixmap. Is there anything else a user should be able to pass? The context should be coming from SSSOM itself, while the prefixes can be passed in - or am I missing something?

--input should accept file paths and urls

It would be great if --input could generally except file paths and URLs. For example, from_tsv(tsv_path).

Fix the remaining black issues

you can run

pip install tox
tox

To get the list of errors, currently:

sssom/util.py:301:17: W503 line break before binary operator
sssom/util.py:558:109: W291 trailing whitespace
sssom/util.py:604:74: BLK100 Black would make changes.
sssom/util.py:612:17: W503 line break before binary operator
sssom/util.py:613:17: W503 line break before binary operator
sssom/util.py:614:17: W503 line break before binary operator
sssom/util.py:632:13: W503 line break before binary operator
sssom/util.py:633:13: W503 line break before binary operator
sssom/util.py:644:13: W503 line break before binary operator
sssom/util.py:645:13: W503 line break before binary operator
sssom/util.py:646:13: W503 line break before binary operator
sssom/util.py:752:29: F541 f-string is missing placeholders
sssom/sparql_util.py:61:33: W291 trailing whitespace
sssom/sparql_util.py:64:21: W291 trailing whitespace
sssom/sparql_util.py:67:48: W291 trailing whitespace
sssom/__init__.py:1:1: F401 '.sssom_datamodel.Mapping' imported but unused
sssom/__init__.py:1:1: F401 '.sssom_datamodel.MappingSet' imported but unused
sssom/__init__.py:2:1: F401 '.sssom_datamodel.slots' imported but unused
sssom/__init__.py:3:1: F401 '.util.parse' imported but unused
sssom/__init__.py:3:1: F401 '.util.collapse' imported but unused
sssom/__init__.py:3:1: F401 '.util.dataframe_to_ptable' imported but unused
sssom/__init__.py:3:1: F401 '.util.filter_redundant_rows' imported but unused
sssom/__init__.py:3:1: F401 '.util.group_mappings' imported but unused
sssom/__init__.py:3:1: F401 '.util.compare_dataframes' imported but unused
sssom/parsers.py:445:9: W503 line break before binary operator
sssom/parsers.py:446:9: W503 line break before binary operator
sssom/parsers.py:600:21: W503 line break before binary operator
sssom/parsers.py:601:21: W503 line break before binary operator
sssom/cli.py:568:11: BLK100 Black would make changes.
sssom/cliques.py:20:5: F841 local variable 'm' is assigned to but never used
sssom/cliques.py:163:13: BLK100 Black would make changes.
sssom/cliques.py:163:17: E126 continuation line over-indented for hanging indent
tests/test_collapse.py:13:1: F401 'logging' imported but unused
tests/test_collapse.py:44:9: F841 local variable 'mappings' is assigned to but never used
tests/test_cli.py:157:13: F841 local variable 'out_file' is assigned to but never used

If we can clear these that would be great.. Also, I can see quite a few warnings just using the pycharm code inspection that relates to some of the code in util.py; probably worth reviewing at some point

Need write access

Need write access to repo
need instructions on how to publish on pypi.

Error with linkml in `sssom validate`

I'm working on exporting Biomappings to SSSOM (biopragmatics/biomappings#45) but before I'm done, I want to use the sssom validate command to check my TSV is properly formatted. Unfortunately, it's causing the following error due to upstream issues in linkml:

(indra) cthoyt@galvac ~/d/biomappings> sssom validate docs/_data/biomappings.sssom.tsv 
Traceback (most recent call last):
  File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 33, in <module>
    sys.exit(load_entry_point('sssom', 'console_scripts', 'sssom')())
  File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 855, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/cthoyt/dev/sssom-py/sssom/__init__.py", line 1, in <module>
    from .util import (
  File "/Users/cthoyt/dev/sssom-py/sssom/util.py", line 11, in <module>
    from sssom.datamodel_util import MappingSetDiff, EntityPair
  File "/Users/cthoyt/dev/sssom-py/sssom/datamodel_util.py", line 5, in <module>
    from sssom.sssom_document import MappingSetDocument
  File "/Users/cthoyt/dev/sssom-py/sssom/sssom_document.py", line 1, in <module>
    from .sssom_datamodel import MappingSet, Mapping, Entity
  File "/Users/cthoyt/dev/sssom-py/sssom/sssom_datamodel.py", line 16, in <module>
    from linkml.utils.slot import Slot
ModuleNotFoundError: No module named 'linkml.utils.slot'

I think sssom validate can accept a URL, so you could also do this to verify:

$ sssom validate https://github.com/biomappings/biomappings/raw/add-sssom-export/docs/_data/biomappings.sssom.tsv

sssom-py and owl:subClassOf and its inverse

Some py files talk about owl:subClassOf and inverseOf(owl:subClassOf). These dont seem to be implement though as part of the convert operation, or am I wrong?

unclosed stream causes many warnings

def read_csv(filename, comment='#', sep=','):
    lines = "".join([line for line in open(filename)
                     if not line.startswith(comment)])
    return pd.read_csv(StringIO(lines), sep=sep)

This should be rewritten so that we don't get:

/Users/matentzn/ws/sssom-py/sssom/datamodel_util.py:198: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/matentzn/ws/sssom-py/tests/data/basic.tsv' mode='r' encoding='UTF-8'>
  lines = "".join([line for line in open(filename)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

No functional code should be in cli.py (apart from file extension inference if necessary). Everything else should go into io.py for now, ideally using the exact same name.

use slots to refer to sssom data elements

All of these should probably be in a separate python file just for static variables.

Originally posted by @matentzn in #72 (comment)

Use bioregistry context as default context for prefix resolution

https://github.com/bioregistry/bioregistry/blob/generalize-prefixe-maps/docs/_data/contexts/obo_context_synonyms.jsonld

@cthoyt thanks

Clean up cli.py

Check this for some chaotic brain dump: https://github.com/mapping-commons/sssom-py/blob/master/cli.md

All methods apart from sssom parse should take as an input embedded SSSOM which is read properly using the internal model (fromTsv). No filetype inference needed at all, input MUST conform to embedded sssom (embedded means: metadata is on top of data frame)
Clean all CLI arguments to be uniform according to what is in cli.md above, or whatever you think is best and most understandable.
Add click metadata for help text (description of methods etc)

Potential problem with read_pandas: # in rows

Currently we ar using

pd.read_csv(filename, comment='#', sep=sep).fillna("")

To read a pandas data frame. This can be risky, due to the potential of hash symbols in the middle of rows. It would be more robust to manually skip rows which start with # and read the rest.

Create function to clean prefix map according to actual prefix usage in dataframe

Sometimes, there are more prefixes than necessary in the prefix map and we may want to clean them out.

I propose to add a method like this to MappingSetDataFrame:

    def clean_prefix_map(self):
        prefixes_in_map = get_prefixes_used_in_table(self.df)
        new_prefixes = dict()
        missing_prefix = False
        for prefix in prefixes_in_map:
            if prefix in self.prefixmap:
                new_prefixes[prefix]=self.prefixmap[prefix]
            else:
                logging.warning(f"{prefix} is used in the data frame but does not exist in prefix map")
                missing_prefix = True
        if not missing_prefix:
            self.prefixmap=missing_prefix

New command: diff

discussing with @matentzn on mapping call now

For each subject-object tuple is each source

present in one but not the other
present in both
- present in both and consistent values for other columns
- present in both but different values; e.g. confidence may be low in one and high in another

maybe think in the context of other set-wise operations: union, intersection

Add ability to select or manipulate rows using pandas or sql

it would be useful to be able to

filter rows e.g. based on source, confidence
order by source

rather than lots of specific commands, could be done generically with sql or pandas filters

New command: validate

Simple variation of the parser that returns true or false and warnings.

Discuss: Add functionality to check if a mapping is to or from an obsolete

This would be v useful, but is it scope creep? does this belong instead in robot? cc @balhoff

Review pycharm warnings in Pair Coding session

Import SNOMED mapping format import

For simple mapping sets this should be straight forward.
For complex mapping sets this is much harder, and maybe not even possible.

New command: sort

See mapping-commons/sssom#39

We need a command that uses the canonical column order to sort an entire mapping set data frame correctly.

Add a de-dupe command

Frequently an automated process will yield different edges for any S,P pair. It could be convenient to collapse these. Rather than trying to compress all evidence into one row (not possible), simply ditch redundant lines if lower confidence. Potentially preserve the information somehow, e.g. in comments

Rename all _from_x_ method to _read_x_

curie_map should not be part of msdf.metadata

Has to do with how extract_global_metadata is written.

Apply `black` code style

Even though it takes a bit of getting used to, the black code formatter is proving to be a godsend on projects with multiple authors that all write python with slightly different styles.

I'm using it in all of my collaborative projects now, I'd suggest setting it up here. It's also possible to use a pre-commit hook so people can write code however they want, then it automatically fixes it just before pushing to github. Then you can also add a CI action to check that black has been applied before accepting PRs.

Let me know if you want me to add the config/docs that would help you get started. In the mean time, you could always see what happens when you do:

$ pip install black
$ black sssom/ tests/

Check if metadata matches df content in validate

Move sssom-py to mapping-commons

Please move sssom-py to https://github.com/mapping-commons and make me co-owner :)

Consider using click.option() instead of click.argument()

After discussion with @matentzn , do we need consistency of CLI commands for using click.option() at certain commands and click.argument() for others?

As of now, click.argument() is used by:

diff
partition
dosql
merge

See cli.py

Cc: @cmungall

ptable command uses non-existent variable

New method: separate

Break an SSSOM file into multiple SSSOM files such that any individual SSSOM file contains only mappings between two distinct ontologies. For example:

mapping.sssom.tsv:

A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005
A:001 skos:exactMatch C:001

should be broken into:

mapping_a_b.sssom.tsv:

A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005

mapping_a_c.sssom.tsv:

A:001 skos:exactMatch C:001

Customizing mapping of ontologies to SSSOM, including making nodes for literals

When making an sssom from an ontology there are multiple points of configuration

what predicate to use for a bare xref (where xrefs are used)
what confidence value to use. See https://github.com/cmungall/obo-scripts/blob/master/obo-xrefs-to-sssom.pl

Rather than a complex configuration, I think it makes sense to have a standard ontology representation and a deterministic mapping. We would then do either pre-processing on the ontology (e.g. robot/sparql) or on the sssom file.

Another thing that would be very useful for mapping is to translate literals into nodes, potentially performing stemming etc. Then lexical matching becomes transitive closure over, e.g. T1 -> label -> T2. This has advantages for boomer as we get a combined lexical/synonym analysis, cc @balhoff -- however, this could be considered scope creep for sssom-py

New command: merge

This is for the CLI version of merge_msdf

Rename read_csv (util) function to read_pandas

and make reading \t the default.

reconcile write_sssom and write_tsv

For some reason we have two methods to write sssom files:

write_sssom and write_tsv.

@hrshdhgd can you figure out how they are different?

All commands should preserve header metadata

current sssom-py requires curie_map to be present

I think this makes sense, but we should make it explicit in the spec.

MappingSetDataFrame should have a "merge" method

MappingSetDataFrame should have a "merge" method with two parameters:

def merge(self, msdf, reconcile=True):
     // merging msdf into self, if reduce=True, then dedupe (remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both humancurated, prefer negative mapping).

Unifying read_pandas method

There are currently 4 or more ways to read a dataframe. We should reconcile them into one.

Abstract the SSSOM File away from MappingSetDocument

Create a new class SSSOMDocument
This class contains a MappingSetDocument and a direct reference to the pandas data frame
Replace all usages of MappingSetDocument in method signatures with SSSOMDocument

QC is not triggered for PRs from Forks

See https://github.com/mapping-commons/sssom-py/pull/90/files

Enable piping/stdout streaming in sssom-py

@hrshdhgd and I are working on the SSSOM CLI and putting on the finishing touches. As a last major effort I would like to implement piping/stdout streaming so I can do stuff like:

sssom dedupe -i input.sssom.tsv | grep OMIM | sssom convert --output-format json | jq some-jq-filter > output.json.

Imagine this CLI method

def dedupe(input,output):
   ...

def convert(input,output,ouput_format):
   ...

What would be the right way to design dedupe and convert to enable that kind of piping?

Some missing msdf returns

from_alignment_minidom
from_dataframe
from_obographs
from_owl_graph
from_rdf_graph

I think these are all :) Some of these are not even implemented yet, but lets be proactive! :)

Disentangle and simplify imports structure in sssom

Conider merging utils and datamodel utils

New command: filter

We need a filter command that allows us to subset a mapping table for more common filter operations easily. The alternative is a powerful filter language that will end up reimplementing SQL, which we already have in DOSQL. So for the moment, these are my most important reasons to filter:

I want to get rid of unused prefixes in the prefix map
I want to get mappings only between mondo and do terms
I only want exact and broad matches, but not oboInOwl:hasDbXref

There are many more things I want, but I can do the rest with dosql

I am thinking of something like this:

sssom filter input.sssom.tsv --prefixes "MP HP MONDO" --predicates "skos:exactMatch skos:broadMatch"

I am also ok with

sssom filter input.sssom.tsv --prefix MP --prefix HP --prefix MONDO --predicate_id skos:exactMatch --predicate_id skos:broadMatch

If that feels cleaner, but the selection logic in the prefix case should be conjunctive, i.e. the Mapping should not have any other prefix in the subject or object ids other than the ones provided.

There is some case to be made for this design as well:

sssom filter input.sssom.tsv --subject_id MP:* --subject_id MP:* --object_id MP:* --predicate_id skos:exactMatch --predicate_id skos:broadMatch

then the subject_id parameter (multivalued) should be conjunctive. Actually, I kinda like this last proposal now. It should follow the same dynamic parameter logic as #280?