Giter Site home page Giter Site logo

CLDF dataset loading pains about beastling HOT 19 OPEN

lmaurits avatar lmaurits commented on September 6, 2024
CLDF dataset loading pains

from beastling.

Comments (19)

xrotwang avatar xrotwang commented on September 6, 2024 1

The more robust way is via the ontology, i.e. you'd lookup the local column name by ontology term. This can be done using the local part of the term URI (https://cldf.clld.org/v1.0/terms.rdf#glottocode):

glottocode_col_name = dataset["LanguageTable", "glottocode"].name

from beastling.

Anaphory avatar Anaphory commented on September 6, 2024

I would actually expect the CLDF format to mandate this?

It does not mandate that they exist, but it specifies row to recognize them when they are there.

ideally our XML "template" should be robust against people doing very strange things like naming one of their features identically to a language

Yes. It should become that way.

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

I'd really hope to get some functionality to "binarize" data into pycldf - or some sort of plugin system - too. So for the uralex data this would mean turning cognate sets into binary features, right? So far we only did a bit of brainstorming on this - whether to organize such functionality by CLDF module or rather by output format, like a cldf_nexus plugin. We probably go with the latter - so maybe there should be a cldf_beastling plugin, too?

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

I've just made a commit to use glottocodes, ISO codes or human-readable names (in that order of preference) for language identifiers and names for features, when reading CLDF data. It works very nicely for Uralex (and I can now, e.g., specify calibrations using Glottolog names for subfamilies and have everything work), but I don't know how robust it is to variation in CLDF formatting. Is it safe to directly access some fields of the LanguageTable by name (e.g. "Glottocode") or should I be first fetching the appropriate name to use, as is done elsewhere for e.g. language_column?

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

What would the output of a hypothetical cldf_beastling plugin be, precisely?

The binarising thing is somewhat fraught with peril, at least where BEAST is concerned, due to the BEAST design decision to intermix ascertainment correction settings with "real data".

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

@lmaurits hm, ok. My thoughts on this are somewhat half-baked, I guess. cldf_beastling probably stands for "functionality that could either go into BEASTling or into pycldf - but we don't know yet". So at this point, it probably boils down to whenever CLDF specific functionality is implemented in BEASTling, we should have an eye on if/how this may be generalized to other CLDF data consumers.

from beastling.

Anaphory avatar Anaphory commented on September 6, 2024

I've just made a commit to use glottocodes, ISO codes or human-readable names (in that order of preference) for language identifiers and names for features, when reading CLDF data.

Magic...

I have lects that share glottolog-IDs. Human readable names might contain all sorts of characters (, is the first one that comes to mind) that don't play nicely with Beast IDs.

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

Aaah, excellent point with the non-uniqueness of glottocodes. Hmm.

Do we have a single consistent sanitise_for_beast_id utility function anywhere? I'd say that we almost certainly should, and it should be used here (and presumably in many other places).

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

I've used the onotology to make that recent commit a bit more robust. Sorry if I'm missing something and this is a silly question, but wouldn't a very nice and useful feature of a Python interface to CLDF datasets be to read the metadata.json file and then map all non-standard column names to the standard names, so that people can write neat code which will run on any arbitrary CLDF dataset?

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

@lmaurits well, the idea is that the term URIs are the standard column IDs, but maybe one could augment result dicts with local uri names as alternative keys?

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

@lmaurits but yes, having to construct local a lookup whenever iterating over CLDF tables seems a bit clumsy.

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

see cldf/pycldf#86

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

I think that the alternative keys would definitely improve usability, I'm glad that you are open to the idea!

I really hate to seem negative about it, because as you know I am strongly in favour of deprecating Nexus as the lingua franca for linguistic data, and of CLDF in particular, but based on my limited experience pycldf overall feels, indeed, a bit clumsy. It's possible I haven't properly found my way around yet, but I would have expected that, e.g. something like a utility script to transform a multi-file CLDF dataset with metadata into a single metadata-free file (with some obvious loss of information, I'm just talking about the bare-bones language, feature, value kind of files we used to use - obvious limitations aside, I think those files were very handy, and eminently greppable) would be a short and elegant affair. What I've written so far feels only marginally better than just using the standard library's csv and looping over the individual files myself, using my own understanding of the foreign-key relationships between tables.

If it's not outside what you consider the scope of the library, I think it would be positively lovely for Datasets to have a method returning a generator which yields (something like) tuples of language, feature, value datapoints, where the three tuple components were themselves namedtuples, with local URI members for IDs, names, glottocodes, etc. I realise there are different kinds of Datasets and I'm only vaguely familiar with two of them, and the three-part lang/feat/value conceptualisation of a datapoint may not apply universally. My point is there should be some Dataset appropriate way to get at the datapoints themselves in a way which mirrors what the data conceptually is, not how it is structured in tables. Basically some kind of built-in sensible JOIN operation on the tables.

If this doesn't already exist and sounds non-awful to you and other CLDF-insiders, I'm very happy to actually try to write it, instead of just complaining.

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

Magic...

If this was intended as a grumble about BEASTling making silent and invisible and non-overridable decisions about how to interpret input data, then fair enough and point taken! I am open to alternatives, but I do feel very strongly that the current behaviour (of using CLDF table row IDs as the identifiers of languages) can't be left as-is. BEASTling was designed from day one around very tight Glottolog integration, and lots of beautiful things "just work" when language IDs are Glottocodes or ISO codes. If users want to use something else for any reason they should always be free to, but I believe they should never have to do extra work to use Glottocodes or ISO codes when those options are available to BEASTling. When you hand BEASTling a CLDF dataset where every language contains a unique Glottocode or ISO code, and do nothing else, you should be surprised, confused and angry when BEASTling chooses to ignore them.

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

I think I understand what you mean (and in fact, I haven't had enough experience with pycldf myself to get fully over the clumsiness feeling). From my experience with LingPy, though, I don't want us to repeat the mistake of adding all sorts of short-cuts and aliases right away. So I guess we have to suffer a bit more before the right API crystallizes :)

Regarding the greppable single file representation of a CLDF dataset: I'm a bit on the fence regarding such funtionality. It would certainly be possible to de-normalize a CLDF dataset into a single CSV file - and for most of our datasets this would even be a somewhat manageable file in terms of size and dimensions. OTOH I want to help people realize what normalization means, how it helps, and how to properly deal with normalized data (e.g. using csvjoin and csvgrep instead of just grep :) ).

Anyway, the way I think CLDF will eventually support a standard denormalization is via SQLite - i.e. via a standard conversion of a CLDF dataset to a SQLite db, and a standard VIEW in this db, providing the denormalized dataset - which can then be trivially exported to CSV. This would at least overcome the scaleability issue with flat file denormalization only.

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

@lmaurits this issue: cldf/pycldf#58 is the key for future pycldf development, I guess. My first attempts at SQLite conversion for CLDF got derailed by trying to add in support for multiple datasets in a single SQLite db. While this is useful as well, as shown in pylexibank or pyclics, I think the functionality in pycldf should stick to just one dataset, but provide full round-tripping.

from beastling.

lmaurits avatar lmaurits commented on September 6, 2024

The SQLite plan sounds marvellous, and in fact very smart because it means with well-written queries it will be possible to get SQLite to do most of the work which would have gone into my native-Python implementation described above.

There's no great rush, as long as a convenient and standard denormalisation strategy is on the drawing board, I'm happy. The normalised format is great for reducing redundancy and ensuring consistency (and other things, I'm sure), but it's not what I want to work with when manipulating the data.

from beastling.

xrotwang avatar xrotwang commented on September 6, 2024

@lmaurits yes, that's the plan: Allowing people to share "well-written" queries for common data manipulation tasks.

from beastling.

Anaphory avatar Anaphory commented on September 6, 2024

Yes, that was exactly that grumble.

The thing is, what do we do with the IDs if they are glottocode?

  • We use Glottolog geo-locations. That specific type of phylogeography is therefore permitted to look up glottocodes (try, in that order, #glottocode; #languageReference with ValueURL linking to Glottolog; #iso639P3code; #languageReference; fail). This is signalled quite well: If I run phylogeography, and my location data has gaps that could be filled by Glottocode, they will be according to that hierarchy; if I want to override this, I can specify the location explicitly as ?,?.
  • We use language groups, for morphology constraints and MRCA priors. Here, my use case would be served by defining abui1241-takal and abui1241-ulaga, which both have a glottocode abui1241 specified, as members of the language group abui1241, which should (under correct implementation) mean that a constraint that involves abui1241 will constrain Takalelang and Ulaga as intended and expected. This becomes a problem when someone has an additional ‘standard’ variety, which has both ID and Glottocode abui1241; this is not inconceivable, I think we had that at some point in LexiRumah.
  • Specifying ‘Use all monophyly constraints from GL’ is a thing, I think it's not a separate issue but I'm not sure.

What other use cases did I forget for Glottocodes?

from beastling.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.