Giter Site home page Giter Site logo

lnschema-core's People

Contributors

bpenteado avatar falexwolf avatar fredericenard avatar koncopd avatar sunnyosun avatar zethson avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

lnschema-core's Issues

How to access the local cache of a `DObject`?

I spent a lot of time looking for a method to get the path to the local cache of a DObject when building a PyTorch workflow with lamindb, only to realize that this is accomplished by load().

I knew that load() returned the path to the local cache, but it seemed unintuitive and wrong (inefficient) to use it just to access the path, since the "load" semantics conveys the idea that a lot more is happening under the hood.

Should there be a less ambiguous, single-responsibility method to get the path to the local cache of a DObject, say local_path()?

File suffix not flexible enough

https://github.com/laminlabs/lnschema-core/blob/main/lnschema_core/models.py#L1089

Ran into this during the hackathon and am running into it again for the nextflow RNA-seq use-case. File names that heavily rely on . for separation exceed the 30 character limit.

During our discussions we decided to IIRC always use the last suffix and if there are multiple we take the last two.

Featureset example broken?

import lnschema_bionty as bt
reference = bt.Gene(species="mouse")
features = ln.Features.from_iterable(adata.var["ensemble_id"], Gene.ensembl_gene_id)
features.save()
file = ln.File(adata, name="Mouse Lymph Node scRNA-seq")
file.save()
file.featuresets.add(featureset)

file.featuresets.add(featureset) featureset is undefined here, right? Maybe it's too early in the morning

Camel case vs. snake case in SQL table vs. corresponding Python class definitions

Snake case is typically preferred over camel case for naming tables in relational DBs because SQL is case insensitive.

Additionally, column/field names can have a direct correspondence with table names. Say dobject_id as a column refers to the id column in the table dobject. If the table where to be called DObject, both dobject and d_object would be legitimate snake case conversions. Indeed, there is a wide variety in how typical snake case conversion functions deal with grouped capital letters.

On the other hand, we prefer camel case for Python class definitions.

One solution we investigated was auto-generating snake case names for tables from camel case class names (see appendix). However, we didn't find a solution in which we felt this yields a visually appealing solution (important when interacting directly with the database).

Given the far-reaching consequences of defining table names, and the importance of having them directly be identified with the corresponding python classes, we decided to define Python classes that represent SQL tables in snake case.

On the other hand, we still want to provide the Python user a CamelCase experience. Auto-generating camel case from snake case comes with ambiguities at least in cases with adjacent capital letters (sql_model -> SQLModel or SqlModel?). Hence, we decided to re-export snake case class definitions from a submodule at the core schema module level as

from ._core import dobject as DObject
from ._core import dtransform as DTransform
...

This means minimal work (compared to defining the schema) to define the snake_case to camel case mapping, but we feel that is justified. Also, we feel that making changes to these names (which might be desired for the afore-mentioned ambiguity) is less tragic than making changes to the underlying "SQL table ground truth" names, as these would require a migration.

Appendix

We investigated many potential auto-generated solutions for generating snake case names from camel case class names starting out with this suggestion.

For auto-generated snake case we always found camel case examples that didn't appeal to us. The most appealing auto-generated solution we found is the following - but even this one has at least one case that we feel uncomfortable to hard-code:

# from https://codereview.stackexchange.com/questions/185966/
import re

def _jl_match(match):
    group = match.group()
    prefix = bool(match.start() and not group.startswith('_'))
    return '_' * prefix + group.lower()

REGEX = r'(.(?:[^a-z_]+(?=[A-Z_]|$)|[^A-Z_]+))'

def snake_case(string):
    return re.subn(REGEX, _jl_match, string)[0].lower()

Applied to these test cases,

tests = [
  ("Dobject", "dobject"),    
  ("DObject", "d_object"),    # we'd prefer dobject here!
  ("SQLModel", "sql_model"),  # we'd prefer sqlmodel here!
  ("PipelineRun", "pipeline_run"),
  ("DobjectBiometa", "dobject_biometa"),
  ('CamelCASERules', 'camel_case_rules'),
  ('IndexID', 'index_id'),
  ('CamelCASE', 'camel_case'), 
  ('AnID', 'an_id'),
  ('An_ID', 'an_id'),
  ('AnmmIDnn', 'anmmi_dnn'),  # this seems hard to accept, we'd expect anm_i_dnn
  ('AnIDInn', 'an_id_inn'), 
  ('theIDForUSGovAndDOD', 'the_id_for_us_gov_and_dod'), 
  ('TheID_', 'the_id_'),
]
for test in tests:
    print(snake_case(test[0]), snake_case(test[0]) == test[1])

it produces the following results:

dobject True
d_object True
sql_model True
pipeline_run True
dobject_biometa True
camel_case_rules True
index_id True
camel_case True
an_id True
an_id True
anmmi_dnn True
an_id_inn True
the_id_for_us_gov_and_dod True
the_id_ True

Encoding Feature and ULabel uids

Currently, we are not encoding uids of Feature and ULabel, which then cause FeatureSet between different instances to have different uid (hash of the feature uids) even though they contain the same features.
This is a broader discussion about how to make uid immutable and name mutable, and also track the identity of record. For bionty records, this can be done via hashing ontology_id instead of name.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.