Comments (19)
The more robust way is via the ontology, i.e. you'd lookup the local column name by ontology term. This can be done using the local part of the term URI (https://cldf.clld.org/v1.0/terms.rdf#glottocode):
glottocode_col_name = dataset["LanguageTable", "glottocode"].name
from beastling.
I would actually expect the CLDF format to mandate this?
It does not mandate that they exist, but it specifies row to recognize them when they are there.
ideally our XML "template" should be robust against people doing very strange things like naming one of their features identically to a language
Yes. It should become that way.
from beastling.
I'd really hope to get some functionality to "binarize" data into pycldf
- or some sort of plugin system - too. So for the uralex data this would mean turning cognate sets into binary features, right? So far we only did a bit of brainstorming on this - whether to organize such functionality by CLDF module or rather by output format, like a cldf_nexus
plugin. We probably go with the latter - so maybe there should be a cldf_beastling
plugin, too?
from beastling.
I've just made a commit to use glottocodes, ISO codes or human-readable names (in that order of preference) for language identifiers and names for features, when reading CLDF data. It works very nicely for Uralex (and I can now, e.g., specify calibrations using Glottolog names for subfamilies and have everything work), but I don't know how robust it is to variation in CLDF formatting. Is it safe to directly access some fields of the LanguageTable by name (e.g. "Glottocode") or should I be first fetching the appropriate name to use, as is done elsewhere for e.g. language_column
?
from beastling.
What would the output of a hypothetical cldf_beastling
plugin be, precisely?
The binarising thing is somewhat fraught with peril, at least where BEAST is concerned, due to the BEAST design decision to intermix ascertainment correction settings with "real data".
from beastling.
@lmaurits hm, ok. My thoughts on this are somewhat half-baked, I guess. cldf_beastling
probably stands for "functionality that could either go into BEASTling or into pycldf - but we don't know yet". So at this point, it probably boils down to whenever CLDF specific functionality is implemented in BEASTling, we should have an eye on if/how this may be generalized to other CLDF data consumers.
from beastling.
I've just made a commit to use glottocodes, ISO codes or human-readable names (in that order of preference) for language identifiers and names for features, when reading CLDF data.
Magic...
I have lects that share glottolog-IDs. Human readable names might contain all sorts of characters (,
is the first one that comes to mind) that don't play nicely with Beast IDs.
from beastling.
Aaah, excellent point with the non-uniqueness of glottocodes. Hmm.
Do we have a single consistent sanitise_for_beast_id
utility function anywhere? I'd say that we almost certainly should, and it should be used here (and presumably in many other places).
from beastling.
I've used the onotology to make that recent commit a bit more robust. Sorry if I'm missing something and this is a silly question, but wouldn't a very nice and useful feature of a Python interface to CLDF datasets be to read the metadata.json file and then map all non-standard column names to the standard names, so that people can write neat code which will run on any arbitrary CLDF dataset?
from beastling.
@lmaurits well, the idea is that the term URIs are the standard column IDs, but maybe one could augment result dict
s with local uri names as alternative keys?
from beastling.
@lmaurits but yes, having to construct local a lookup whenever iterating over CLDF tables seems a bit clumsy.
from beastling.
see cldf/pycldf#86
from beastling.
I think that the alternative keys would definitely improve usability, I'm glad that you are open to the idea!
I really hate to seem negative about it, because as you know I am strongly in favour of deprecating Nexus as the lingua franca for linguistic data, and of CLDF in particular, but based on my limited experience pycldf overall feels, indeed, a bit clumsy. It's possible I haven't properly found my way around yet, but I would have expected that, e.g. something like a utility script to transform a multi-file CLDF dataset with metadata into a single metadata-free file (with some obvious loss of information, I'm just talking about the bare-bones language, feature, value kind of files we used to use - obvious limitations aside, I think those files were very handy, and eminently greppable) would be a short and elegant affair. What I've written so far feels only marginally better than just using the standard library's csv
and looping over the individual files myself, using my own understanding of the foreign-key relationships between tables.
If it's not outside what you consider the scope of the library, I think it would be positively lovely for Dataset
s to have a method returning a generator which yields (something like) tuples of language, feature, value datapoints, where the three tuple components were themselves namedtuple
s, with local URI members for IDs, names, glottocodes, etc. I realise there are different kinds of Datasets and I'm only vaguely familiar with two of them, and the three-part lang/feat/value conceptualisation of a datapoint may not apply universally. My point is there should be some Dataset appropriate way to get at the datapoints themselves in a way which mirrors what the data conceptually is, not how it is structured in tables. Basically some kind of built-in sensible JOIN
operation on the tables.
If this doesn't already exist and sounds non-awful to you and other CLDF-insiders, I'm very happy to actually try to write it, instead of just complaining.
from beastling.
Magic...
If this was intended as a grumble about BEASTling making silent and invisible and non-overridable decisions about how to interpret input data, then fair enough and point taken! I am open to alternatives, but I do feel very strongly that the current behaviour (of using CLDF table row IDs as the identifiers of languages) can't be left as-is. BEASTling was designed from day one around very tight Glottolog integration, and lots of beautiful things "just work" when language IDs are Glottocodes or ISO codes. If users want to use something else for any reason they should always be free to, but I believe they should never have to do extra work to use Glottocodes or ISO codes when those options are available to BEASTling. When you hand BEASTling a CLDF dataset where every language contains a unique Glottocode or ISO code, and do nothing else, you should be surprised, confused and angry when BEASTling chooses to ignore them.
from beastling.
I think I understand what you mean (and in fact, I haven't had enough experience with pycldf myself to get fully over the clumsiness feeling). From my experience with LingPy, though, I don't want us to repeat the mistake of adding all sorts of short-cuts and aliases right away. So I guess we have to suffer a bit more before the right API crystallizes :)
Regarding the greppable single file representation of a CLDF dataset: I'm a bit on the fence regarding such funtionality. It would certainly be possible to de-normalize a CLDF dataset into a single CSV file - and for most of our datasets this would even be a somewhat manageable file in terms of size and dimensions. OTOH I want to help people realize what normalization means, how it helps, and how to properly deal with normalized data (e.g. using csvjoin
and csvgrep
instead of just grep
:) ).
Anyway, the way I think CLDF will eventually support a standard denormalization is via SQLite - i.e. via a standard conversion of a CLDF dataset to a SQLite db, and a standard VIEW
in this db, providing the denormalized dataset - which can then be trivially exported to CSV. This would at least overcome the scaleability issue with flat file denormalization only.
from beastling.
@lmaurits this issue: cldf/pycldf#58 is the key for future pycldf
development, I guess. My first attempts at SQLite conversion for CLDF got derailed by trying to add in support for multiple datasets in a single SQLite db. While this is useful as well, as shown in pylexibank
or pyclics
, I think the functionality in pycldf
should stick to just one dataset, but provide full round-tripping.
from beastling.
The SQLite plan sounds marvellous, and in fact very smart because it means with well-written queries it will be possible to get SQLite to do most of the work which would have gone into my native-Python implementation described above.
There's no great rush, as long as a convenient and standard denormalisation strategy is on the drawing board, I'm happy. The normalised format is great for reducing redundancy and ensuring consistency (and other things, I'm sure), but it's not what I want to work with when manipulating the data.
from beastling.
@lmaurits yes, that's the plan: Allowing people to share "well-written" queries for common data manipulation tasks.
from beastling.
Yes, that was exactly that grumble.
The thing is, what do we do with the IDs if they are glottocode?
- We use Glottolog geo-locations. That specific type of phylogeography is therefore permitted to look up glottocodes (try, in that order, #glottocode; #languageReference with ValueURL linking to Glottolog; #iso639P3code; #languageReference; fail). This is signalled quite well: If I run phylogeography, and my location data has gaps that could be filled by Glottocode, they will be according to that hierarchy; if I want to override this, I can specify the location explicitly as
?,?
. - We use language groups, for morphology constraints and MRCA priors. Here, my use case would be served by defining
abui1241-takal
andabui1241-ulaga
, which both have a glottocodeabui1241
specified, as members of the language groupabui1241
, which should (under correct implementation) mean that a constraint that involvesabui1241
will constrain Takalelang and Ulaga as intended and expected. This becomes a problem when someone has an additional ‘standard’ variety, which has both ID and Glottocodeabui1241
; this is not inconceivable, I think we had that at some point in LexiRumah. - Specifying ‘Use all monophyly constraints from GL’ is a thing, I think it's not a separate issue but I'm not sure.
What other use cases did I forget for Glottocodes?
from beastling.
Related Issues (20)
- "The Yule model is not particularly suitable for linguistic analyses, however it is currently the best BEAST has to offer." HOT 7
- ASR tests fail HOT 33
- log_params duplicates log entries. HOT 1
- Refactor beastrun tests? HOT 6
- Unify file handling HOT 2
- Remove the dependency on clldutils HOT 2
- Remaining synonym issues
- Hack BEAST XML using DOM notation? HOT 2
- OSError: File name too long, instead of being interpreted as list HOT 1
- Make TreePrior.add_state_nodes use state instead of beastxml
- LogNormal Relaxed Clock parameterized by “variance”, is actually “stdev in log space”
- PyYAML, an indirect dependency, needs Python>3.4 HOT 4
- Document and Unify Exit Codes
- Should base-clocks/models/priors be abstract?
- Report could use slightly more clever templating
- Should __distribution__ be a property of all clocks?
- 'BinaryModel' object has no attribute 'share_params' HOT 11
- Binary CTMC binarised=True
- BEASTling does not clean up concept names with commas properly
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from beastling.