laminlabs / lnschema-core Goto Github PK
View Code? Open in Web Editor NEWRegistries for basic data management.
License: Apache License 2.0
Registries for basic data management.
License: Apache License 2.0
I spent a lot of time looking for a method to get the path to the local cache of a DObject
when building a PyTorch workflow with lamindb
, only to realize that this is accomplished by load()
.
I knew that load()
returned the path to the local cache, but it seemed unintuitive and wrong (inefficient) to use it just to access the path, since the "load" semantics conveys the idea that a lot more is happening under the hood.
Should there be a less ambiguous, single-responsibility method to get the path to the local cache of a DObject
, say local_path()
?
https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py#L1089
Ran into this during the hackathon and am running into it again for the nextflow RNA-seq use-case. File names that heavily rely on .
for separation exceed the 30 character limit.
During our discussions we decided to IIRC always use the last suffix and if there are multiple we take the last two.
The storage_root
string can get very long and will need many bytes and hamper readability. The below is from the current lamindb docs (e.g. https://lamin.ai/docs/db/tutorials/introspect)
@fredericenard, maybe we should switch to a more lightweight key (integer or a rather short base62) sooner than later. ๐
import lnschema_bionty as bt
reference = bt.Gene(species="mouse")
features = ln.Features.from_iterable(adata.var["ensemble_id"], Gene.ensembl_gene_id)
features.save()
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.save()
file.featuresets.add(featureset)
file.featuresets.add(featureset)
featureset is undefined here, right? Maybe it's too early in the morning
Snake case is typically preferred over camel case for naming tables in relational DBs because SQL is case insensitive.
Additionally, column/field names can have a direct correspondence with table names. Say dobject_id
as a column refers to the id column in the table dobject
. If the table where to be called DObject
, both dobject
and d_object
would be legitimate snake case conversions. Indeed, there is a wide variety in how typical snake case conversion functions deal with grouped capital letters.
On the other hand, we prefer camel case for Python class definitions.
One solution we investigated was auto-generating snake case names for tables from camel case class names (see appendix). However, we didn't find a solution in which we felt this yields a visually appealing solution (important when interacting directly with the database).
Given the far-reaching consequences of defining table names, and the importance of having them directly be identified with the corresponding python classes, we decided to define Python classes that represent SQL tables in snake case.
On the other hand, we still want to provide the Python user a CamelCase experience. Auto-generating camel case from snake case comes with ambiguities at least in cases with adjacent capital letters (sql_model
-> SQLModel
or SqlModel
?). Hence, we decided to re-export snake case class definitions from a submodule at the core schema module level as
from ._core import dobject as DObject
from ._core import dtransform as DTransform
...
This means minimal work (compared to defining the schema) to define the snake_case to camel case mapping, but we feel that is justified. Also, we feel that making changes to these names (which might be desired for the afore-mentioned ambiguity) is less tragic than making changes to the underlying "SQL table ground truth" names, as these would require a migration.
We investigated many potential auto-generated solutions for generating snake case names from camel case class names starting out with this suggestion.
For auto-generated snake case we always found camel case examples that didn't appeal to us. The most appealing auto-generated solution we found is the following - but even this one has at least one case that we feel uncomfortable to hard-code:
# from https://codereview.stackexchange.com/questions/185966/
import re
def _jl_match(match):
group = match.group()
prefix = bool(match.start() and not group.startswith('_'))
return '_' * prefix + group.lower()
REGEX = r'(.(?:[^a-z_]+(?=[A-Z_]|$)|[^A-Z_]+))'
def snake_case(string):
return re.subn(REGEX, _jl_match, string)[0].lower()
Applied to these test cases,
tests = [
("Dobject", "dobject"),
("DObject", "d_object"), # we'd prefer dobject here!
("SQLModel", "sql_model"), # we'd prefer sqlmodel here!
("PipelineRun", "pipeline_run"),
("DobjectBiometa", "dobject_biometa"),
('CamelCASERules', 'camel_case_rules'),
('IndexID', 'index_id'),
('CamelCASE', 'camel_case'),
('AnID', 'an_id'),
('An_ID', 'an_id'),
('AnmmIDnn', 'anmmi_dnn'), # this seems hard to accept, we'd expect anm_i_dnn
('AnIDInn', 'an_id_inn'),
('theIDForUSGovAndDOD', 'the_id_for_us_gov_and_dod'),
('TheID_', 'the_id_'),
]
for test in tests:
print(snake_case(test[0]), snake_case(test[0]) == test[1])
it produces the following results:
dobject True
d_object True
sql_model True
pipeline_run True
dobject_biometa True
camel_case_rules True
index_id True
camel_case True
an_id True
an_id True
anmmi_dnn True
an_id_inn True
the_id_for_us_gov_and_dod True
the_id_ True
Adapt lnschema-bionty as well
There is a small performance penalty because a query is needed.
Should we consider using the semantic accessor as the primary key to avoid that penalty? ๐ค
Transform.parents
: https://github.com/laminlabs/laminhub/issues/35Might be very important unless we find another solution:
Currently, we only track the title.
lnschema-core/lnschema_core/dev/sqlmodel.py
Lines 17 to 23 in ee8919f
They're clearly active at time of import...
Currently, we are not encoding uids of Feature and ULabel, which then cause FeatureSet between different instances to have different uid (hash of the feature uids) even though they contain the same features.
This is a broader discussion about how to make uid immutable and name mutable, and also track the identity of record. For bionty records, this can be done via hashing ontology_id instead of name.
should dfolder also have a storage column?
how do we prevent users to create duplicated dfolder record?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.