mapping-commons / sssom-py Goto Github PK
View Code? Open in Web Editor NEWPython toolkit for SSSOM mapping format
Home Page: https://mapping-commons.github.io/sssom-py/index.html#
License: MIT License
Python toolkit for SSSOM mapping format
Home Page: https://mapping-commons.github.io/sssom-py/index.html#
License: MIT License
you can run
pip install tox
tox
To get the list of errors, currently:
sssom/util.py:301:17: W503 line break before binary operator
sssom/util.py:558:109: W291 trailing whitespace
sssom/util.py:604:74: BLK100 Black would make changes.
sssom/util.py:612:17: W503 line break before binary operator
sssom/util.py:613:17: W503 line break before binary operator
sssom/util.py:614:17: W503 line break before binary operator
sssom/util.py:632:13: W503 line break before binary operator
sssom/util.py:633:13: W503 line break before binary operator
sssom/util.py:644:13: W503 line break before binary operator
sssom/util.py:645:13: W503 line break before binary operator
sssom/util.py:646:13: W503 line break before binary operator
sssom/util.py:752:29: F541 f-string is missing placeholders
sssom/sparql_util.py:61:33: W291 trailing whitespace
sssom/sparql_util.py:64:21: W291 trailing whitespace
sssom/sparql_util.py:67:48: W291 trailing whitespace
sssom/__init__.py:1:1: F401 '.sssom_datamodel.Mapping' imported but unused
sssom/__init__.py:1:1: F401 '.sssom_datamodel.MappingSet' imported but unused
sssom/__init__.py:2:1: F401 '.sssom_datamodel.slots' imported but unused
sssom/__init__.py:3:1: F401 '.util.parse' imported but unused
sssom/__init__.py:3:1: F401 '.util.collapse' imported but unused
sssom/__init__.py:3:1: F401 '.util.dataframe_to_ptable' imported but unused
sssom/__init__.py:3:1: F401 '.util.filter_redundant_rows' imported but unused
sssom/__init__.py:3:1: F401 '.util.group_mappings' imported but unused
sssom/__init__.py:3:1: F401 '.util.compare_dataframes' imported but unused
sssom/parsers.py:445:9: W503 line break before binary operator
sssom/parsers.py:446:9: W503 line break before binary operator
sssom/parsers.py:600:21: W503 line break before binary operator
sssom/parsers.py:601:21: W503 line break before binary operator
sssom/cli.py:568:11: BLK100 Black would make changes.
sssom/cliques.py:20:5: F841 local variable 'm' is assigned to but never used
sssom/cliques.py:163:13: BLK100 Black would make changes.
sssom/cliques.py:163:17: E126 continuation line over-indented for hanging indent
tests/test_collapse.py:13:1: F401 'logging' imported but unused
tests/test_collapse.py:44:9: F841 local variable 'mappings' is assigned to but never used
tests/test_cli.py:157:13: F841 local variable 'out_file' is assigned to but never used
If we can clear these that would be great.. Also, I can see quite a few warnings just using the pycharm code inspection that relates to some of the code in util.py; probably worth reviewing at some point
I think these are all :) Some of these are not even implemented yet, but lets be proactive! :)
it would be useful to be able to
rather than lots of specific commands, could be done generically with sql or pandas filters
Given two mapping files, it would be useful to know which mappings are unique to each file and which are shared in common
Different criteria could be applied to determine what counts as mappings are unique. The most useful may be compare distinct pairs, such that even if two files differ on predicate, confidence, mapping fields etc, we get a broad picture of how they differ.
It would be useful to get results back as an SSSOM file, which is the merger of the two files, with each row annotated as to whether it is 1, 2, or both
No functional code should be in cli.py (apart from file extension inference if necessary). Everything else should go into io.py for now, ideally using the exact same name.
Break an SSSOM file into multiple SSSOM files such that any individual SSSOM file contains only mappings between two distinct ontologies. For example:
mapping.sssom.tsv:
A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005
A:001 skos:exactMatch C:001
should be broken into:
mapping_a_b.sssom.tsv:
A:001 skos:exactMatch B:001
A:002 skos:exactMatch B:005
mapping_a_c.sssom.tsv:
A:001 skos:exactMatch C:001
Check this for some chaotic brain dump: https://github.com/mapping-commons/sssom-py/blob/master/cli.md
sssom parse
should take as an input embedded SSSOM which is read properly using the internal model (fromTsv). No filetype inference needed at all, input MUST conform to embedded sssom (embedded means: metadata is on top of data frame)def read_csv(filename, comment='#', sep=','):
lines = "".join([line for line in open(filename)
if not line.startswith(comment)])
return pd.read_csv(StringIO(lines), sep=sep)
This should be rewritten so that we don't get:
/Users/matentzn/ws/sssom-py/sssom/datamodel_util.py:198: ResourceWarning: unclosed file <_io.TextIOWrapper name='/Users/matentzn/ws/sssom-py/tests/data/basic.tsv' mode='r' encoding='UTF-8'>
lines = "".join([line for line in open(filename)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
There are currently 4 or more ways to read a dataframe. We should reconcile them into one.
Sometimes, there are more prefixes than necessary in the prefix map and we may want to clean them out.
I propose to add a method like this to MappingSetDataFrame:
def clean_prefix_map(self):
prefixes_in_map = get_prefixes_used_in_table(self.df)
new_prefixes = dict()
missing_prefix = False
for prefix in prefixes_in_map:
if prefix in self.prefixmap:
new_prefixes[prefix]=self.prefixmap[prefix]
else:
logging.warning(f"{prefix} is used in the data frame but does not exist in prefix map")
missing_prefix = True
if not missing_prefix:
self.prefixmap=missing_prefix
I'm working on exporting Biomappings to SSSOM (biopragmatics/biomappings#45) but before I'm done, I want to use the sssom validate
command to check my TSV is properly formatted. Unfortunately, it's causing the following error due to upstream issues in linkml
:
(indra) cthoyt@galvac ~/d/biomappings> sssom validate docs/_data/biomappings.sssom.tsv
Traceback (most recent call last):
File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 33, in <module>
sys.exit(load_entry_point('sssom', 'console_scripts', 'sssom')())
File "/Users/cthoyt/.virtualenvs/indra/bin/sssom", line 25, in importlib_load_entry_point
return next(matches).load()
File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/usr/local/Cellar/[email protected]/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 855, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/Users/cthoyt/dev/sssom-py/sssom/__init__.py", line 1, in <module>
from .util import (
File "/Users/cthoyt/dev/sssom-py/sssom/util.py", line 11, in <module>
from sssom.datamodel_util import MappingSetDiff, EntityPair
File "/Users/cthoyt/dev/sssom-py/sssom/datamodel_util.py", line 5, in <module>
from sssom.sssom_document import MappingSetDocument
File "/Users/cthoyt/dev/sssom-py/sssom/sssom_document.py", line 1, in <module>
from .sssom_datamodel import MappingSet, Mapping, Entity
File "/Users/cthoyt/dev/sssom-py/sssom/sssom_datamodel.py", line 16, in <module>
from linkml.utils.slot import Slot
ModuleNotFoundError: No module named 'linkml.utils.slot'
I think sssom validate
can accept a URL, so you could also do this to verify:
$ sssom validate https://github.com/biomappings/biomappings/raw/add-sssom-export/docs/_data/biomappings.sssom.tsv
Conider merging utils and datamodel utils
Frequently an automated process will yield different edges for any S,P pair. It could be convenient to collapse these. Rather than trying to compress all evidence into one row (not possible), simply ditch redundant lines if lower confidence. Potentially preserve the information somehow, e.g. in comments
SSSOMDocument
This would be v useful, but is it scope creep? does this belong instead in robot? cc @balhoff
Given an sssom triple A superclass_of B, this should be turned into B SubClassOf A
@matentzn : we may have prematurely replaced the hacky but functional perl code I had in the COB repo...
mdutils is a bit clunky but the table generattion stiff works ok
this will replace https://github.com/cmungall/obo-scripts/blob/master/tbl2ghwiki
We need a command that uses the canonical column order to sort an entire mapping set data frame correctly.
When making an sssom from an ontology there are multiple points of configuration
Rather than a complex configuration, I think it makes sense to have a standard ontology representation and a deterministic mapping. We would then do either pre-processing on the ontology (e.g. robot/sparql) or on the sssom file.
Another thing that would be very useful for mapping is to translate literals into nodes, potentially performing stemming etc. Then lexical matching becomes transitive closure over, e.g. T1 -> label -> T2. This has advantages for boomer as we get a combined lexical/synonym analysis, cc @balhoff -- however, this could be considered scope creep for sssom-py
Has to do with how extract_global_metadata
is written.
Some py files talk about owl:subClassOf
and inverseOf(owl:subClassOf)
. These dont seem to be implement though as part of the convert operation, or am I wrong?
Please move sssom-py to https://github.com/mapping-commons and make me co-owner :)
@kshefchek later in our meeting will ask you something
and make reading \t
the default.
I think this makes sense, but we should make it explicit in the spec.
discussing with @matentzn on mapping call now
For each subject-object tuple is each source
maybe think in the context of other set-wise operations: union, intersection
Currently we ar using
pd.read_csv(filename, comment='#', sep=sep).fillna("")
To read a pandas data frame. This can be risky, due to the potential of hash symbols in the middle of rows. It would be more robust to manually skip rows which start with # and read the rest.
MappingSetDataFrame should have a "merge" method with two parameters:
def merge(self, msdf, reconcile=True):
// merging msdf into self, if reduce=True, then dedupe (remove redundant lower confidence mappings) and reconcile (if msdf contains a higher confidence _negative_ mapping, then remove lower confidence positive one. If confidence is the same, prefer HumanCurated. If both humancurated, prefer negative mapping).
@hrshdhgd and I are working on the SSSOM CLI and putting on the finishing touches. As a last major effort I would like to implement piping/stdout streaming so I can do stuff like:
sssom dedupe -i input.sssom.tsv | grep OMIM | sssom convert --output-format json | jq some-jq-filter > output.json.
Imagine this CLI method
def dedupe(input,output):
...
def convert(input,output,ouput_format):
...
What would be the right way to design dedupe
and convert
to enable that kind of piping?
This is for the CLI version of merge_msdf
All of these should probably be in a separate python file just for static variables.
Originally posted by @matentzn in #72 (comment)
Even though it takes a bit of getting used to, the black
code formatter is proving to be a godsend on projects with multiple authors that all write python with slightly different styles.
I'm using it in all of my collaborative projects now, I'd suggest setting it up here. It's also possible to use a pre-commit hook so people can write code however they want, then it automatically fixes it just before pushing to github. Then you can also add a CI action to check that black has been applied before accepting PRs.
Let me know if you want me to add the config/docs that would help you get started. In the mean time, you could always see what happens when you do:
$ pip install black
$ black sssom/ tests/
All throughout SSSOM we see this context parameter to functions. I have always kinda thought of it as the prefixmap. Is there anything else a user should be able to pass? The context should be coming from SSSOM itself, while the prefixes can be passed in - or am I missing something?
We need a filter command that allows us to subset a mapping table for more common filter operations easily. The alternative is a powerful filter language that will end up reimplementing SQL, which we already have in DOSQL. So for the moment, these are my most important reasons to filter:
There are many more things I want, but I can do the rest with dosql
I am thinking of something like this:
sssom filter input.sssom.tsv --prefixes "MP HP MONDO" --predicates "skos:exactMatch skos:broadMatch"
I am also ok with
sssom filter input.sssom.tsv --prefix MP --prefix HP --prefix MONDO --predicate_id skos:exactMatch --predicate_id skos:broadMatch
If that feels cleaner, but the selection logic in the prefix case should be conjunctive, i.e. the Mapping should not have any other prefix in the subject or object ids other than the ones provided.
There is some case to be made for this design as well:
sssom filter input.sssom.tsv --subject_id MP:* --subject_id MP:* --object_id MP:* --predicate_id skos:exactMatch --predicate_id skos:broadMatch
then the subject_id parameter (multivalued) should be conjunctive. Actually, I kinda like this last proposal now. It should follow the same dynamic parameter logic as #280?
So far we have focussed on sssom imports, but here we need to make sure the export formats work well.
For some reason we have two methods to write sssom files:
write_sssom and write_tsv.
@hrshdhgd can you figure out how they are different?
Simple variation of the parser that returns true or false and warnings.
It would be great if --input
could generally except file paths and URLs. For example, from_tsv(tsv_path).
I don't think the docstring of a command line is really the place where people show examples. You have manpages for most CLI-only tools, and SSSOM should also have user documentation with Sphinx. This is where usage and examples should go.
Originally posted by @cthoyt in #50 (comment)
We need a method that calls
dedupe
and then
removes higher confidence negative mappings, see also:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.