laminlabs / lnschema-core Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 1.27 MB

Registries for basic data management.

License: Apache License 2.0

Python 100.00%

lnschema-core's People

Contributors

Stargazers

Watchers

lnschema-core's Issues

How to access the local cache of a `DObject`?

I spent a lot of time looking for a method to get the path to the local cache of a DObject when building a PyTorch workflow with lamindb, only to realize that this is accomplished by load().

I knew that load() returned the path to the local cache, but it seemed unintuitive and wrong (inefficient) to use it just to access the path, since the "load" semantics conveys the idea that a lot more is happening under the hood.

Should there be a less ambiguous, single-responsibility method to get the path to the local cache of a DObject, say local_path()?

using has no example docstring

https://lamin.ai/docs/lamindb.core.Registry.html#lamindb.core.Registry.using

File suffix not flexible enough

https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py#L1089

Ran into this during the hackathon and am running into it again for the nextflow RNA-seq use-case. File names that heavily rely on . for separation exceed the 30 character limit.

During our discussions we decided to IIRC always use the last suffix and if there are multiple we take the last two.

Consider light-weight primary key for storage table

The storage_root string can get very long and will need many bytes and hamper readability. The below is from the current lamindb docs (e.g. https://lamin.ai/docs/db/tutorials/introspect)

@fredericenard, maybe we should switch to a more lightweight key (integer or a rather short base62) sooner than later. 😇

Rename `_filekey` to `_objectkey`

Featureset example broken?

import lnschema_bionty as bt
reference = bt.Gene(species="mouse")
features = ln.Features.from_iterable(adata.var["ensemble_id"], Gene.ensembl_gene_id)
features.save()
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.save()
file.featuresets.add(featureset)

file.featuresets.add(featureset) featureset is undefined here, right? Maybe it's too early in the morning

Camel case vs. snake case in SQL table vs. corresponding Python class definitions

Snake case is typically preferred over camel case for naming tables in relational DBs because SQL is case insensitive.

Additionally, column/field names can have a direct correspondence with table names. Say dobject_id as a column refers to the id column in the table dobject. If the table where to be called DObject, both dobject and d_object would be legitimate snake case conversions. Indeed, there is a wide variety in how typical snake case conversion functions deal with grouped capital letters.

On the other hand, we prefer camel case for Python class definitions.

One solution we investigated was auto-generating snake case names for tables from camel case class names (see appendix). However, we didn't find a solution in which we felt this yields a visually appealing solution (important when interacting directly with the database).

Given the far-reaching consequences of defining table names, and the importance of having them directly be identified with the corresponding python classes, we decided to define Python classes that represent SQL tables in snake case.

On the other hand, we still want to provide the Python user a CamelCase experience. Auto-generating camel case from snake case comes with ambiguities at least in cases with adjacent capital letters (sql_model -> SQLModel or SqlModel?). Hence, we decided to re-export snake case class definitions from a submodule at the core schema module level as

from ._core import dobject as DObject
from ._core import dtransform as DTransform
...

This means minimal work (compared to defining the schema) to define the snake_case to camel case mapping, but we feel that is justified. Also, we feel that making changes to these names (which might be desired for the afore-mentioned ambiguity) is less tragic than making changes to the underlying "SQL table ground truth" names, as these would require a migration.

Appendix

We investigated many potential auto-generated solutions for generating snake case names from camel case class names starting out with this suggestion.

For auto-generated snake case we always found camel case examples that didn't appeal to us. The most appealing auto-generated solution we found is the following - but even this one has at least one case that we feel uncomfortable to hard-code:

# from https://codereview.stackexchange.com/questions/185966/
import re

def _jl_match(match):
    group = match.group()
    prefix = bool(match.start() and not group.startswith('_'))
    return '_' * prefix + group.lower()

REGEX = r'(.(?:[^a-z_]+(?=[A-Z_]|$)|[^A-Z_]+))'

def snake_case(string):
    return re.subn(REGEX, _jl_match, string)[0].lower()

Applied to these test cases,

tests = [
  ("Dobject", "dobject"),    
  ("DObject", "d_object"),    # we'd prefer dobject here!
  ("SQLModel", "sql_model"),  # we'd prefer sqlmodel here!
  ("PipelineRun", "pipeline_run"),
  ("DobjectBiometa", "dobject_biometa"),
  ('CamelCASERules', 'camel_case_rules'),
  ('IndexID', 'index_id'),
  ('CamelCASE', 'camel_case'), 
  ('AnID', 'an_id'),
  ('An_ID', 'an_id'),
  ('AnmmIDnn', 'anmmi_dnn'),  # this seems hard to accept, we'd expect anm_i_dnn
  ('AnIDInn', 'an_id_inn'), 
  ('theIDForUSGovAndDOD', 'the_id_for_us_gov_and_dod'), 
  ('TheID_', 'the_id_'),
]
for test in tests:
    print(snake_case(test[0]), snake_case(test[0]) == test[1])

it produces the following results:

dobject True
d_object True
sql_model True
pipeline_run True
dobject_biometa True
camel_case_rules True
index_id True
camel_case True
an_id True
an_id True
anmmi_dnn True
an_id_inn True
the_id_for_us_gov_and_dod True
the_id_ True

repr for registry

Context: https://laminlabs.slack.com/archives/C04MU979KD3/p1712306606861379?thread_ts=1711880502.921519&cid=C04MU979KD3

Adapt lnschema-bionty as well

`filepath_from_dobject` should leverage the storage table and not settings

There is a small performance penalty because a query is needed.

Should we consider using the semantic accessor as the primary key to avoid that penalty? 🤔

Fix typo `dobjects_features`, should be `dobject_features`

Consider tracking modality also on File level in case there are no semantic features (like for images)

🔖 Migrations for next release

Improve Transform.parents: https://github.com/laminlabs/laminhub/issues/35

What is the matter with non-detected foreign key constraints sometimes?

Add a unique constraint to `Storage` root

https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py#L672

Release script that ensures the correct migration number is provided

Might be very important unless we find another solution:

d1c3fe3

Track both filename & title in Jupyter notebooks / and add storage location

Currently, we only track the title.

Why don't naming conventions seem to be applied at init?

lnschema-core/lnschema_core/dev/sqlmodel.py

Lines 17 to 23 in ee8919f

    
           sqm.SQLModel.metadata.naming_convention = { 
        
               "ix": "ix_%(column_0_label)s", 
        
               "uq": "uq_%(table_name)s_%(column_0_name)s", 
        
               "ck": "ck_%(table_name)s_`%(constraint_name)s`", 
        
               "fk": "fk_%(table_name)s_%(column_0_name)s_%(referred_table_name)s", 
        
               "pk": "pk_%(table_name)s", 
        
           }

They're clearly active at time of import...

Encoding Feature and ULabel uids

Currently, we are not encoding uids of Feature and ULabel, which then cause FeatureSet between different instances to have different uid (hash of the feature uids) even though they contain the same features.
This is a broader discussion about how to make uid immutable and name mutable, and also track the identity of record. For bionty records, this can be done via hashing ontology_id instead of name.

Add unique constraint of DFolder._objectkey and storage

should dfolder also have a storage column?

how do we prevent users to create duplicated dfolder record?

Function headers with immutable types consistently

#138

	sqm.SQLModel.metadata.naming_convention = {
	"ix": "ix_%(column_0_label)s",
	"uq": "uq_%(table_name)s_%(column_0_name)s",
	"ck": "ck_%(table_name)s_`%(constraint_name)s`",
	"fk": "fk_%(table_name)s_%(column_0_name)s_%(referred_table_name)s",
	"pk": "pk_%(table_name)s",
	}