Giter Site home page Giter Site logo

moli-mandala / data Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 2.0 281.7 MB

Raw data files and conversion scripts for CLDF output for Jambu data.

Python 69.59% TeX 17.24% Jupyter Notebook 12.88% Makefile 0.30%
cldf historical-linguistics south-asian-language

data's Introduction

Point Theme

Point is a Jekyll theme for personal websites that are simple and to the point.

Preview: point-theme.netlify.app

The theme is fully responsive, so it looks good and works on devices of all sizes. All pages are written in Markdown for ease of editing and writing.

To use Point, fork this repo and make your own changes. Be sure to customize the _config.yml file, and you can also change colors and fonts in styles/styles.scss. Have fun!

This theme uses the MIT license.

To report a bug or request a feature, please create an issue.

preview

data's People

Contributors

aryamanarora avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

data's Issues

double counted lemmata

In CDIAL, some dialectal lemmata are double counted under their parent lect:

  • Sindhi: Kutchi
  • Kashmiri: Dodi, Poguli
  • Romani: all Romani + Domari lects

It's hard to tell when this signifies "both standard form and this dialect have this lemma" and when it signifies "only this dialect has this lemma". Will need a heuristic solution to eliminate duplication.

CDIAL parse script is broken

Seems all glosses are missing, probably due to change in quotes formatting on DSAL website. Best to refactor parse script entirely, i.e. skip json stage and parse (almost) straight to CLDF.

Strange categorization of Punjabi varieties

I don't know how faithful you intend to be to the sources, but the language names and categories as they are treat only eastern varieties of Punjabi as part of the language and place the rest in nebulous categories. There are separate "Pahari-Pothohari" and "Lahnda: Rawalpindi" categories for example even though Pothohari is the name of the dialect of Rawalpindi, and Lehndi (not "Lahnda") dialects like Pothohari are Punjabi dialects.

Turner's dictionary divisions into "L." and "P." are practically not very useful as they come from a misunderstanding of his sources. Everything from Bhai Maya Singh's dictionary was labeled "P." even though that dictionary includes quite a lot of dialectal western Punjabi words for example. Few English speaking writers seem to be aware of the fact that quite a lot of Gurmukhi literature published in India has been written by speakers of Pothohari and not of eastern dialects (such as Duni Chandar's grammar).

It is also not clear how the geographic information presented is meant to be interpreted with respect to these varieties - due to the canal colony migrations in the 19th century for ex., the dialect of Jalandhar became the dialect of Lyallpur/Faisalabad and what is now spoken in Jalandhar is closer to the dialect of Sialkot etc.

Feature requests from Rob

Hi! Old Jambu being down for maintenance reminded me I wanted to review neojambu for you! I've actually used it very little because some of the features I need for work aren't implemented (yet?). (1) The main thing I'm missing is the ability to sort a list alphabetically (by either "entry" or "reflex" form), because I spend a lot of time going through whole lists rather than doing targeted searches. (2) Also, when I'm looking at a language, for instance Khowar, the entries aren't really alphabetically sorted, but kind of random, starting at ma- and then leaping to a- then other letters. If I look for "entries", these are alphabetically sorted, but "reflexes" are also kind of random. (3) I do like being able to search the glosses of both "entries" and "reflexes".

DEDR parse script

There are a lot of inconsistencies in the way Suresh parsed DEDR compared to the data format in Jambu (e.g., each lemma should have one row of its own). For maintainability it's probably better to write our own parse script for DEDR.

Transcription errors

  • <at>l</at> outside of italics in words for /ɫ/
  • single/double letters randomly showing up (e.g. /kk/ in Assamese)

Compounds

How to handle compounds in lexical entries? CDIAL double lists them. There is definitely a way to do this in CLDF, check how Dictionaria handles it for the Palula lexicon. (Maybe handle this in the preprocessing step of generating the database? Double list compounds in the database then?)

Lexical entry metadata

  • Add a column for grammatical metadata in the lexicon (e.g. gender, part of speech)
  • Parse this information from CDIAL while keeping other comments in the Description column

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.