Giter Site home page Giter Site logo

mapping-commons / sssom-py Goto Github PK

View Code? Open in Web Editor NEW
48.0 6.0 10.0 17.33 MB

Python toolkit for SSSOM mapping format

Home Page: https://mapping-commons.github.io/sssom-py/index.html#

License: MIT License

Makefile 0.59% Shell 1.21% Python 98.15% Batchfile 0.04%
sssom obofoundry python mappings ontology-mappings linkml

sssom-py's Introduction

Python Utilities for SSSOM

Tests PyPI PyPI - Python Version PyPI - License Code style: black

SSSOM (Simple Standard for Sharing Ontology Mappings) is a TSV and RDF/OWL standard for ontology mappings

WARNING: 
    The export formats (json, rdf) of sssom-py are not yet finalised! 
    Please expect changes in future releases!

See https://github.com/OBOFoundry/SSSOM

This is a python library and command line toolkit for working with SSSOM. It also defines a schema for SSSOM.

Documentation

See documentation

Deploy documentation

make sphinx
make deploy-docs

Schema

See the schema/ folder for source schema in YAML, plus derivations to JSON-Schema, ShEx, etc.

Testing

tox is similar to make, but specific for Python software projects. Its configuration is stored in tox.ini in different "environments" whose headers look like [testenv:...]. All tests can be run with:

$ pip install tox
$ tox

A specific environment can be run using the -e flag, such as tox -e lint to run the linting environment.

Outstanding Contributors

Outstanding contributors are groups and institutions that have helped with organising the SSSOM Python package's development, providing funding, advice and infrastructure. We are very grateful for all your contribution - the project would not exist without you!

Harvard Medical School

Harvard Medical School Logo

The INDRA Lab, a part of the Laboratory of Systems Pharmacology and the Harvard Program in Therapeutic Science (HiTS), is interested in natural language processing and large-scale knowledge assembly. Their work on SSSOM is funded by the DARPA Young Faculty Award W911NF2010255 (PI: Benjamin M. Gyori).

https://indralab.github.io

sssom-py's People

Contributors

anitacaron avatar bgyori avatar cmungall avatar cthoyt avatar dependabot[bot] avatar github-actions[bot] avatar glass-ships avatar hrshdhgd avatar joeflack4 avatar matentzn avatar syphax-bouazzouni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sssom-py's Issues

Create user docs with Sphinx or similar

I don't think the docstring of a command line is really the place where people show examples. You have manpages for most CLI-only tools, and SSSOM should also have user documentation with Sphinx. This is where usage and examples should go.

Originally posted by @cthoyt in #50 (comment)

Add a diff command

Given two mapping files, it would be useful to know which mappings are unique to each file and which are shared in common

Different criteria could be applied to determine what counts as mappings are unique. The most useful may be compare distinct pairs, such that even if two files differ on predicate, confidence, mapping fields etc, we get a broad picture of how they differ.

It would be useful to get results back as an SSSOM file, which is the merger of the two files, with each row annotated as to whether it is 1, 2, or both

What was the original idea to pass around contexts?

All throughout SSSOM we see this context parameter to functions. I have always kinda thought of it as the prefixmap. Is there anything else a user should be able to pass? The context should be coming from SSSOM itself, while the prefixes can be passed in - or am I missing something?

Fix the remaining black issues

you can run

pip install tox
tox

To get the list of errors, currently:

sssom/util.py:301:17: W503 line break before binary operator
sssom/util.py:558:109: W291 trailing whitespace
sssom/util.py:604:74: BLK100 Black would make changes.
sssom/util.py:612:17: W503 line break before binary operator
sssom/util.py:613:17: W503 line break before binary operator
sssom/util.py:614:17: W503 line break before binary operator
sssom/util.py:632:13: W503 line break before binary operator
sssom/util.py:633:13: W503 line break before binary operator
sssom/util.py:644:13: W503 line break before binary operator
sssom/util.py:645:13: W503 line break before binary operator
sssom/util.py:646:13: W503 line break before binary operator
sssom/util.py:752:29: F541 f-string is missing placeholders
sssom/sparql_util.py:61:33: W291 trailing whitespace
sssom/sparql_util.py:64:21: W291 trailing whitespace
sssom/sparql_util.py:67:48: W291 trailing whitespace
sssom/__init__.py:1:1: F401 '.sssom_datamodel.Mapping' imported but unused
sssom/__init__.py:1:1: F401 '.sssom_datamodel.MappingSet' imported but unused
sssom/__init__.py:2:1: F401 '.sssom_datamodel.slots' imported but unused
sssom/__init__.py:3:1: F401 '.util.parse' imported but unused
sssom/__init__.py:3:1: F401 '.util.collapse' imported but unused
sssom/__init__.py:3:1: F401 '.util.dataframe_to_ptable' imported but unused
sssom/__init__.py:3:1: F401 '.util.filter_redundant_rows' imported but unused
sssom/__init__.py:3:1: F401 '.util.group_mappings' imported but unused
sssom/__init__.py:3:1: F401 '.util.compare_dataframes' imported but unused
sssom/parsers.py:445:9: W503 line break before binary operator
sssom/parsers.py:446:9: W503 line break before binary operator
sssom/parsers.py:600:21: W503 line break before binary operator
sssom/parsers.py:601:21: W503 line break before binary operator
sssom/cli.py:568:11: BLK100 Black would make changes.
sssom/cliques.py:20:5: F841 local variable 'm' is assigned to but never used
sssom/cliques.py:163:13: BLK100 Black would make changes.
sssom/cliques.py:163:17: E126 continuation line over-indented for hanging indent
tests/test_collapse.py:13:1: F401 'logging' imported but unused
tests/test_collapse.py:44:9: F841 local variable 'mappings' is assigned to but never used
tests/test_cli.py:157:13: F841 local variable 'out_file' is assigned to but never used

If we can clear these that would be great.. Also, I can see quite a few warnings just using the pycharm code inspection that relates to some of the code in util.py; probably worth reviewing at some point

Need write access

  • Need write access to repo
  • need instructions on how to publish on pypi.

Error with linkml in `sssom validate`

I'm working on exporting Biomappings to SSSOM (biopragmatics/biomappings#45) but before I'm done, I want to use the sssom validate command to check my TSV is properly formatted. Unfortunately, it's causing the following error due to upstream issues in linkml:

(indra) cthoyt@galvac ~/d/biomappings> sssom validate docs/_data/biomappings.sssom.tsv 
Traceback (most recent call last):
  File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 33, in <module>
    sys.exit(load_entry_point('sssom', 'console_scripts', 'sssom')())
  File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 855, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/cthoyt/dev/sssom-py/sssom/__init__.py", line 1, in <module>
    from .util import (
  File "/Users/cthoyt/dev/sssom-py/sssom/util.py", line 11, in <module>
    from sssom.datamodel_util import MappingSetDiff, EntityPair
  File "/Users/cthoyt/dev/sssom-py/sssom/datamodel_util.py", line 5, in <module>
    from sssom.sssom_document import MappingSetDocument
  File "/Users/cthoyt/dev/sssom-py/sssom/sssom_document.py", line 1, in <module>
    from .sssom_datamodel import MappingSet, Mapping, Entity
  File "/Users/cthoyt/dev/sssom-py/sssom/sssom_datamodel.py", line 16, in <module>
    from linkml.utils.slot import Slot
ModuleNotFoundError: No module named 'linkml.utils.slot'

I think sssom validate can accept a URL, so you could also do this to verify:

$ sssom validate https://github.com/biomappings/biomappings/raw/add-sssom-export/docs/_data/biomappings.sssom.tsv

unclosed stream causes many warnings

def read_csv(filename, comment='#', sep=','):
    lines = "".join([line for line in open(filename)
                     if not line.startswith(comment)])
    return pd.read_csv(StringIO(lines), sep=sep)

This should be rewritten so that we don't get:

/Users/matentzn/ws/sssom-py/sssom/datamodel_util.py:198: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/matentzn/ws/sssom-py/tests/data/basic.tsv' mode='r' encoding='UTF-8'>
  lines = "".join([line for line in open(filename)
ResourceWarning: Enable tracemalloc to get the object allocation traceback

Clean up cli.py

Check this for some chaotic brain dump: https://github.com/mapping-commons/sssom-py/blob/master/cli.md

  • All methods apart from sssom parse should take as an input embedded SSSOM which is read properly using the internal model (fromTsv). No filetype inference needed at all, input MUST conform to embedded sssom (embedded means: metadata is on top of data frame)
  • Clean all CLI arguments to be uniform according to what is in cli.md above, or whatever you think is best and most understandable.
  • Add click metadata for help text (description of methods etc)

Potential problem with read_pandas: # in rows

Currently we ar using

pd.read_csv(filename, comment='#', sep=sep).fillna("")

To read a pandas data frame. This can be risky, due to the potential of hash symbols in the middle of rows. It would be more robust to manually skip rows which start with # and read the rest.

Create function to clean prefix map according to actual prefix usage in dataframe

Sometimes, there are more prefixes than necessary in the prefix map and we may want to clean them out.

I propose to add a method like this to MappingSetDataFrame:

    def clean_prefix_map(self):
        prefixes_in_map = get_prefixes_used_in_table(self.df)
        new_prefixes = dict()
        missing_prefix = False
        for prefix in prefixes_in_map:
            if prefix in self.prefixmap:
                new_prefixes[prefix]=self.prefixmap[prefix]
            else:
                logging.warning(f"{prefix} is used in the data frame but does not exist in prefix map")
                missing_prefix = True
        if not missing_prefix:
            self.prefixmap=missing_prefix

New command: diff

discussing with @matentzn on mapping call now

For each subject-object tuple is each source

  • present in one but not the other
  • present in both
    • present in both and consistent values for other columns
    • present in both but different values; e.g. confidence may be low in one and high in another

maybe think in the context of other set-wise operations: union, intersection

Add a de-dupe command

Frequently an automated process will yield different edges for any S,P pair. It could be convenient to collapse these. Rather than trying to compress all evidence into one row (not possible), simply ditch redundant lines if lower confidence. Potentially preserve the information somehow, e.g. in comments

Apply `black` code style

Even though it takes a bit of getting used to, the black code formatter is proving to be a godsend on projects with multiple authors that all write python with slightly different styles.

I'm using it in all of my collaborative projects now, I'd suggest setting it up here. It's also possible to use a pre-commit hook so people can write code however they want, then it automatically fixes it just before pushing to github. Then you can also add a CI action to check that black has been applied before accepting PRs.

Let me know if you want me to add the config/docs that would help you get started. In the mean time, you could always see what happens when you do:

$ pip install black
$ black sssom/ tests/

New method: separate

Break an SSSOM file into multiple SSSOM files such that any individual SSSOM file contains only mappings between two distinct ontologies. For example:

mapping.sssom.tsv:

A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005
A:001 skos:exactMatch C:001

should be broken into:

mapping_a_b.sssom.tsv:

A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005

mapping_a_c.sssom.tsv:

A:001 skos:exactMatch C:001

Customizing mapping of ontologies to SSSOM, including making nodes for literals

When making an sssom from an ontology there are multiple points of configuration

Rather than a complex configuration, I think it makes sense to have a standard ontology representation and a deterministic mapping. We would then do either pre-processing on the ontology (e.g. robot/sparql) or on the sssom file.

Another thing that would be very useful for mapping is to translate literals into nodes, potentially performing stemming etc. Then lexical matching becomes transitive closure over, e.g. T1 -> label -> T2. This has advantages for boomer as we get a combined lexical/synonym analysis, cc @balhoff -- however, this could be considered scope creep for sssom-py

MappingSetDataFrame should have a "merge" method

MappingSetDataFrame should have a "merge" method with two parameters:

def merge(self, msdf, reconcile=True):
     // merging msdf into self, if reduce=True, then dedupe (remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both humancurated, prefer negative mapping). 

Enable piping/stdout streaming in sssom-py

@hrshdhgd and I are working on the SSSOM CLI and putting on the finishing touches. As a last major effort I would like to implement piping/stdout streaming so I can do stuff like:

sssom dedupe -i input.sssom.tsv | grep OMIM | sssom convert --output-format json | jq some-jq-filter > output.json.

Imagine this CLI method

def dedupe(input,output):
   ...

def convert(input,output,ouput_format):
   ...

What would be the right way to design dedupe and convert to enable that kind of piping?

Some missing msdf returns

  • from_alignment_minidom
  • from_dataframe
  • from_obographs
  • from_owl_graph
  • from_rdf_graph

I think these are all :) Some of these are not even implemented yet, but lets be proactive! :)

New command: filter

We need a filter command that allows us to subset a mapping table for more common filter operations easily. The alternative is a powerful filter language that will end up reimplementing SQL, which we already have in DOSQL. So for the moment, these are my most important reasons to filter:

  • I want to get rid of unused prefixes in the prefix map
  • I want to get mappings only between mondo and do terms
  • I only want exact and broad matches, but not oboInOwl:hasDbXref

There are many more things I want, but I can do the rest with dosql

I am thinking of something like this:

sssom filter input.sssom.tsv --prefixes "MP HP MONDO" --predicates "skos:exactMatch skos:broadMatch"

I am also ok with

sssom filter input.sssom.tsv --prefix MP --prefix HP --prefix MONDO --predicate_id skos:exactMatch --predicate_id skos:broadMatch

If that feels cleaner, but the selection logic in the prefix case should be conjunctive, i.e. the Mapping should not have any other prefix in the subject or object ids other than the ones provided.

There is some case to be made for this design as well:

sssom filter input.sssom.tsv --subject_id MP:* --subject_id MP:* --object_id MP:* --predicate_id skos:exactMatch --predicate_id skos:broadMatch

then the subject_id parameter (multivalued) should be conjunctive. Actually, I kinda like this last proposal now. It should follow the same dynamic parameter logic as #280?

New method "reconcile"

We need a method that calls

dedupe

and then

removes higher confidence negative mappings, see also:

#56

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.