Giter Site home page Giter Site logo

langdeath's Introduction

langdeath's People

Contributors

pajkossy avatar zseder avatar juditacs avatar makrai avatar katerpaul avatar takdavid avatar kornai avatar

Stargazers

 avatar Richard Littauer avatar HughP avatar  avatar  avatar

Watchers

HughP avatar  avatar Richard Littauer avatar  avatar Gábor Recski avatar  avatar  avatar  avatar Peter Kohler avatar  avatar

langdeath's Issues

macro-individual language features

classification error on the seed data concern mainly macro languages (or languages contained by a macro language).
this may be caused by wikipedia features not being coherently added to macro/'largest individual in group'
maybe we could use 'wp_macro...', wp_individual...' extra features

Macro languages

@kornai How to deal with languages that are macro-languages?
There is a mapping here:
http://www-01.sil.org/iso639-3/macrolanguages.asp
but even though sol supposed to be a collective/macro language:
http://www-01.sil.org/iso639-3/documentation.asp?id=son
It's not on this list.

  1. What to do with macro-languages that are not on this mapping?
  2. With a proper mapping, what to do with crúbadán (or other src) data, that are written to a macro-language?

(you can reply to this email directly)

Champion management

  • Languages that have champions or are champions shouldn't be moribund
  • when languages have enough data, we can discard champions from training, when they don't, maybe we should merge the data

@pajkossy @kornai @zseder

crúbadán gave new sils, that weren't in the database before

set([u'son', u'tzs', u'art', u'tzt', u'hva', u'tzz', u'phi', u'cbm', u'gsc', u'tgg', u'prv', u'tze', u'dmn', u'nhs', u'btb', u'tzu', u'apa', u'pob', u'lnc', u'dlc', u'ixi', u'gmo', u'blu', u'nah', u'nai', u'cai', u'tot', u'bnt', u'mvc', u'znd', u'auv', u'cnm', u'sio', u'wen', u'cke', u'daf', u'noo', u'hsf', u'qut', u'bih', u'oto', u'raj ', u'lms', u'mms', u'omq', u'pqw', u'quj', u'suh', u'acc', u'que ', u'ccx', u'jpx', u'eml', u'azc', u'cki', u'tzc', u'ckk', u'fiz', u'ckw', u'ccn', u'sum', u'mvj', u'mly', u'cti', u'agp'])

Alternate name matches

There are some heuristics that will find new languages

  • South - southern
  • Arabic (Egyptian) - Egyptian arabic
  • when the name to be looked up contains more words (southern), but the db doesn't have it, only by itself

dbpedia champions

there are a few cases when the value of the champion(from dbpedia infobox dump) cannot be added because the champion's SIL code is retired (typically to those language for which it is the champion):
bhk gio jar kdv kzh mgx mzf stc sum unp nln

(for the majority of the cases the champion is a macrolanguage (45 langs) or individual language (35 langs) and has a SIL code)

altname keys

some parsers yield dicts where the key for alternate names in other than 'alt_names'
so these names don't get into the database
(e.g. ethnologueparser)

windows supported languages

here:
https://blogs.windows.com/windowsexperience/2014/02/05/over-7000-languages-just-1-windows/
stands
"So by including the main 50 scripts, we can support 7,000 languages, which is enough to support text input for about 98% of people in the world for at least one of the languages they speak. ... Not all of the 7,000 supported languages appear in the list of languages when you’re looking to add one in PC settings or Control Panel. But you can find any supported language in Windows by searching for the name of the language or its IETF tag (Open the Search charm, search for “Add a language,” and then click Add a language)."
so it looks like there is no way to export this list,

getting the language pack is easier:
http://windows.microsoft.com/en-us/windows/language-packs#lptabs=win10

@recski

ethnologue doesn't have data with sils

set([u'osx', u'nno', u'dum', u'ojp', u'owl', u'osp', u'uga', u'tkm', u'nwx', u'ldn', u'tkv', u'pkc', u'pka', u'xep', u'nwa', u'nwc', u'ptw', u'osc', u'och', u'xmn', u'nrp', u'xtr', u'xtq', u'xmk', u'xme', u'yyr', u'gmg', u'xtz', u'auo', u'xtg', u'pgn', u'gmy', u'xto', u'hmk', u'mod', u'pgl', u'xmr', u'xlo', u'qwm', u'lre', u'zkz', u'lcq', u'apv', u'xdm', u'sjk', u'tjm', u'xwo', u'nob', u'ihw', u'lzh', u'ndf', u'nwy', u'mxi', u'nov', u'qwt', u'okm', u'okl', u'oko', u'psu', u'crr', u'xss', u'cmm', u'xsv', u'xli', u'xln', u'ang', u'cmg', u'grz', u'xle', u'xhd', u'xlc', u'umc', u'elx', u'xpc', u'crb', u'emm', u'ule', u'xsa', u'xsd', u'xlu', u'cms', u'xso', u'gev', u'xlp', u'xgl', u'xld', u'xcu', u'aru', u'xgf', u'xlg', u'hlu', u'lfn', u'odt', u'gft', u'xga', u'arc', u'olt', u'cnx', u'imy', u'lng', u'rrt', u'xgr', u'tta', u'oui', u'htx', u'yug', u'zko', u'zkh', u'goh', u'zkk', u'ddr', u'xvs', u'zkg', u'fat', u'zkb', u'onw', u'bll', u'peo', u'zra', u'xvn', u'zkt', u'zku', u'zkv', u'neu', u'xve', u'sux', u'omc', u'zbl', u'sbv', u'xly', u'plq', u'esm', u'omk', u'got', u'xht', u'yms', u'omn', u'omr', u'omp', u'que', u'ecr', u'omx', u'puq', u'ecy', u'ymt', u'xdc', u'xls', u'bue', u'obm', u'xup', u'atc', u'ett', u'xpg', u'emy', u'xpi', u'gml', u'xwc', u'akk', u'etc', u'sjn', u'wlm', u'xng', u'igs', u'krb', u'xum', u'obt', u'tlh', u'obr', u'xrn', u'xch', u'xxb', u'spx', u'twi', u'sog', u'txh', u'txg', u'xil', u'sgs', u'txc', u'txb', u'iml', u'gmh', u'mkq', u'non', u'ynn', u'caj', u'sga', u'rmv', u'xyl', u'txr', u'zsk', u'xlb', u'pie', u'twc', u'xib', u'xqa', u'xpo', u'yrm', u'xpm', u'xap', u'xaq', u'spn', u'fro', u'ims', u'frm', u'xbm', u'jpa', u'xaj', u'xis', u'xiv', u'xsc', u'xps', u'xaa', u'xpp', u'xad', u'xae', u'xpu', u'xag', u'nrk', u'aqt', u'otk', u'nrn', u'xxt', u'nrc', u'xcn', u'otb', u'xzp', u'xhr', u'nkp', u'nci', u'chb', u'pnl', u'oos', u'xpr', u'oge', u'nom', u'xcm', u'xxm', u'nrt', u'xhc', u'xha', u'xur', u'xfa', u'inm', u'ysc', u'gnc', u'avk', u'xhu', u'mtm', u'ygs', u'dlm', u'sty', u'xvo', u'xcb', u'xcc', u'xgb', u'sxc', u'xcg', u'xce', u'vol', u'pro', u'sxk', u'sxl', u'xco', u'xcl', u'sxo', u'xcr', u'czk', u'xcv', u'xcw', u'xct', u'kho', u'oht', u'ohu', u'scx', u'xcy', u'oco', u'tzl', u'aqp', u'hit', u'xna', u'yol', u'mga', u'ido', u'pmh', u'xeb', u'enl', u'enm', u'ina', u'oav', u'oar', u'ota', u'enx', u'xrt', u'pmk', u'pal', u'xrr', u'pyx', u'xrm', u'pox', u'dws', u'phn', u'aes', u'ile', u'sqr', u'xpy', u'orv', u'qyp', u'lab', u'ltc', u'jbo', u'mzg', u'kaw', u'sqn', u'xzm', u'afh', u'qya', u'xzh', u'egy', u'bzt', u'oty', u'tpn', u'nei', u'ofo', u'tpw', u'ofs', u'xbo', u'xbn', u'xno', u'svx', u'ghc', u'nxm', u'mre', u'xbb', u'xqt', u'mjy', u'xbc', u'axm'])

n/a handling

Sometimes avg, sometimes 0, have to be chosen column by column at the stage of exporting into tsv

hunspell tsv

there are arbitrary number of columns in extern/ld/res/hunspell.tsv

save results

We should create some abstract methods into OfflineParser and/or OnlineParser that supports parser running and result saving with pickle to be easily loaded what there is possibly no change in order to be faster
There is a solution for this in dbpedia parsers, parse_and_save(), read_results(), and parse(). Maybe these are okay, but a base class should implement this.

@kornai @pajkossy @juditacs @makrai @zseder

LanguageArchive alternative names

there is a part of language description starting with
Other known names and dialect names:
we should parse this into a list of alternative names

missing rows from Ethnologue tabular

These are the missing rows of individual, living languages.
(Most of the extinct and historic languages are also missing, as well as the macro languages.)
As a consequence now we have a 0 speaker count for all macro languages!
esy
fat
jat
jog
mzg
nno
nob
nrk
ptq
sgs
twi
yhs
yrm

endangered level info

There are multiple levels of the different sources, currently only the last one gets saved

L2 speaker count

We don't have any up to date informaton, the tabular with data from ethnologue does not contain it either

Crúbadán code mismatch

There are some rows in crúbadán, where the code is "sxb" in the first column, but is "suh" in the sil column, which is not a true sil, but "sxb" is, check whether this causes any problems.

Suspicious codes:
dhg
raj
rgn
sxb
tku

endangered parser

info from subsections of 'Language information by source' to be parsed

pickle loading

baseparser silently loads pickles, should log when using pickles because it is confusing

further language variants to add

wals
Arapesh (Abu) Abu' Arapesh
Tommo So Tommo So Dogon

indi
ᏣᎳᎩ Cherokee
Mich keteer Limburgish

wikipedia list of languages
Belarusian (Taraškievica) Belarusian
Norwegian (Bokm\xc3\xa5l)
Norwegian (Nynorsk)
Palatinate German Pfaelzisch
Simple English

wikipedia incubators
Fakauvea Wallisian
Ghomala Ghomálá
Guadeloupean Creole Guadeloupean Creole French
Jamaican Jamaican Creole English
Khinalugh language Khinalug
Kildin Sami language Kildin Sami
Réunion Creole Réunion Creole French
Rinconada Rinconada Bikol
Saint Lucian Creole Saint Lucian Creole French
Solomon Islands Pijin Pijin
Standard Moroccan Amazigh Standard Moroccan Tamazigh
Unami-Lenape Unami
Wastec Huastec
Yawi Pattani Malay
Ɓasaá Basaa
Western Baluchi Western Balochi
Southern Baluchi Southern Baluchi

software support language names
...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.