moli-mandala / data Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 281.7 MB

Raw data files and conversion scripts for CLDF output for Jambu data.

Python 69.59% TeX 17.24% Jupyter Notebook 12.88% Makefile 0.30%

cldf historical-linguistics south-asian-language

data's Introduction

Point Theme

Point is a Jekyll theme for personal websites that are simple and to the point.

Preview: point-theme.netlify.app

The theme is fully responsive, so it looks good and works on devices of all sizes. All pages are written in Markdown for ease of editing and writing.

To use Point, fork this repo and make your own changes. Be sure to customize the _config.yml file, and you can also change colors and fonts in styles/styles.scss. Have fun!

This theme uses the MIT license.

To report a bug or request a feature, please create an issue.

data's People

Contributors

Stargazers

Watchers

Forkers

adamfarris sure-nda

data's Issues

double counted lemmata

In CDIAL, some dialectal lemmata are double counted under their parent lect:

Sindhi: Kutchi
Kashmiri: Dodi, Poguli
Romani: all Romani + Domari lects

It's hard to tell when this signifies "both standard form and this dialect have this lemma" and when it signifies "only this dialect has this lemma". Will need a heuristic solution to eliminate duplication.

CDIAL parse script is broken

Seems all glosses are missing, probably due to change in quotes formatting on DSAL website. Best to refactor parse script entirely, i.e. skip json stage and parse (almost) straight to CLDF.

Strange categorization of Punjabi varieties

I don't know how faithful you intend to be to the sources, but the language names and categories as they are treat only eastern varieties of Punjabi as part of the language and place the rest in nebulous categories. There are separate "Pahari-Pothohari" and "Lahnda: Rawalpindi" categories for example even though Pothohari is the name of the dialect of Rawalpindi, and Lehndi (not "Lahnda") dialects like Pothohari are Punjabi dialects.

Turner's dictionary divisions into "L." and "P." are practically not very useful as they come from a misunderstanding of his sources. Everything from Bhai Maya Singh's dictionary was labeled "P." even though that dictionary includes quite a lot of dialectal western Punjabi words for example. Few English speaking writers seem to be aware of the fact that quite a lot of Gurmukhi literature published in India has been written by speakers of Pothohari and not of eastern dialects (such as Duni Chandar's grammar).

It is also not clear how the geographic information presented is meant to be interpreted with respect to these varieties - due to the canal colony migrations in the 19th century for ex., the dialect of Jalandhar became the dialect of Lyallpur/Faisalabad and what is now spoken in Jalandhar is closer to the dialect of Sialkot etc.

errors found by Suresh

https://neojambu.herokuapp.com/reflexes/1-23267 missing prefix (parse mistake--check if this happens in other instances)

IPA

CDIAL 14190 onwards are mis-parsed

The reflexes are not being parsed due to a lack of paragraph break after the headword line.

Feature requests from Rob

Hi! Old Jambu being down for maintenance reminded me I wanted to review neojambu for you! I've actually used it very little because some of the features I need for work aren't implemented (yet?). (1) The main thing I'm missing is the ability to sort a list alphabetically (by either "entry" or "reflex" form), because I spend a lot of time going through whole lists rather than doing targeted searches. (2) Also, when I'm looking at a language, for instance Khowar, the entries aren't really alphabetically sorted, but kind of random, starting at ma- and then leaping to a- then other letters. If I look for "entries", these are alphabetically sorted, but "reflexes" are also kind of random. (3) I do like being able to search the glosses of both "entries" and "reflexes".

DEDR parse script

There are a lot of inconsistencies in the way Suresh parsed DEDR compared to the data format in Jambu (e.g., each lemma should have one row of its own). For maintainability it's probably better to write our own parse script for DEDR.

Transcription errors

<at>l</at> outside of italics in words for /ɫ/
single/double letters randomly showing up (e.g. /kk/ in Assamese)

Palula transcription standardisation

add Vaagri Boli

https://www.tamildigitallibrary.in/book-detail?id=jZY9lup2kZl6TuXGlZQdjZt2lJU1&tag=Vaagri%20boli#book1/

Compounds

How to handle compounds in lexical entries? CDIAL double lists them. There is definitely a way to do this in CLDF, check how Dictionaria handles it for the Palula lexicon. (Maybe handle this in the preprocessing step of generating the database? Double list compounds in the database then?)

Lexical entry metadata

Add a column for grammatical metadata in the lexicon (e.g. gender, part of speech)
Parse this information from CDIAL while keeping other comments in the Description column

add refs given in CDIAL

e.g. Budruss