The langdeath from kornai

macro-individual language features

classification error on the seed data concern mainly macro languages (or languages contained by a macro language).
this may be caused by wikipedia features not being coherently added to macro/'largest individual in group'
maybe we could use 'wp_macro...', wp_individual...' extra features

Macro languages

@kornai How to deal with languages that are macro-languages?
There is a mapping here:
http://www-01.sil.org/iso639-3/macrolanguages.asp
but even though sol supposed to be a collective/macro language:
http://www-01.sil.org/iso639-3/documentation.asp?id=son
It's not on this list.

What to do with macro-languages that are not on this mapping?
With a proper mapping, what to do with crúbadán (or other src) data, that are written to a macro-language?

(you can reply to this email directly)

Champion management

Languages that have champions or are champions shouldn't be moribund
when languages have enough data, we can discard champions from training, when they don't, maybe we should merge the data

@pajkossy @kornai @zseder

WikipediaListOfLanguagesParser should not be trusted

unless it returns dicts with sil key

MS office languages

@makrai res/office_if_pack is the parsed version of http://office.microsoft.com/en-001/downloads/office-language-interface-pack-lip-downloads-HA001113350.aspx, and lists some languages, but not for example Hungarian, but we are sure that Office is available in Hungarian. Could you check this?

ubuntu language support

https://translations.launchpad.net/ubuntu/vivid/+translations

click on 'View all languages'
it is not clear, how this relates to language-packs

https://launchpad.net/ubuntu/wily/+source/language-pack-hi
(it is not trivial to download all, since not all codes are of lenght 2)

@recski

DBPedia parser should not be trusted

pip install

wiki/Usage:

pip install -U .
pip install -e .
...
Is this necessary?

@juditacs

maxent classification should be implemented in sklearn

insted of maxent module

Wals.info parser gave new sils that we didn't integrate

Probably there is nothing to do with them

bvs
nlr
yuu
wiw
wit
sgl
nbf
poa
izi
jar
mwd
gio
yiy
leg
jai
dap
kzh
mof
unp
ckf
nbx
gbc
stc
baz
tzb
mhv

Someone, please check them

Notify: @kornai @juditacs @pajkossy @makrai

crúbadán gave new sils, that weren't in the database before

set([u'son', u'tzs', u'art', u'tzt', u'hva', u'tzz', u'phi', u'cbm', u'gsc', u'tgg', u'prv', u'tze', u'dmn', u'nhs', u'btb', u'tzu', u'apa', u'pob', u'lnc', u'dlc', u'ixi', u'gmo', u'blu', u'nah', u'nai', u'cai', u'tot', u'bnt', u'mvc', u'znd', u'auv', u'cnm', u'sio', u'wen', u'cke', u'daf', u'noo', u'hsf', u'qut', u'bih', u'oto', u'raj ', u'lms', u'mms', u'omq', u'pqw', u'quj', u'suh', u'acc', u'que ', u'ccx', u'jpx', u'eml', u'azc', u'cki', u'tzc', u'ckk', u'fiz', u'ckw', u'ccn', u'sum', u'mvj', u'mly', u'cti', u'agp'])

Alternate name matches

There are some heuristics that will find new languages

South - southern
Arabic (Egyptian) - Egyptian arabic
when the name to be looked up contains more words (southern), but the db doesn't have it, only by itself

dbpedia champions

there are a few cases when the value of the champion(from dbpedia infobox dump) cannot be added because the champion's SIL code is retired (typically to those language for which it is the champion):
bhk gio jar kdv kzh mgx mzf stc sum unp nln

(for the majority of the cases the champion is a macrolanguage (45 langs) or individual language (35 langs) and has a SIL code)

altname keys

some parsers yield dicts where the key for alternate names in other than 'alt_names'
so these names don't get into the database
(e.g. ethnologueparser)

run autocorpus on new wikipedia dumps + on incubator dumps

Parsers to create

http://glottolog.org/glottolog/language (walsinfo parszer biztos jó lesz hozzá)
http://dobes.mpi.nl/
http://www.verbix.com/maps/where-do-they-speak/

windows supported languages

here:
https://blogs.windows.com/windowsexperience/2014/02/05/over-7000-languages-just-1-windows/
stands
"So by including the main 50 scripts, we can support 7,000 languages, which is enough to support text input for about 98% of people in the world for at least one of the languages they speak. ... Not all of the 7,000 supported languages appear in the list of languages when you’re looking to add one in PC settings or Control Panel. But you can find any supported language in Windows by searching for the name of the language or its IETF tag (Open the Search charm, search for “Add a language,” and then click Add a language)."
so it looks like there is no way to export this list,

getting the language pack is easier:
http://windows.microsoft.com/en-us/windows/language-packs#lptabs=win10

@recski

GoogleTranslateParser

parsing error; appearently wikipedia site has changed

Software/OS language errors

Now I renamed the shared doc from omni errors to langdeath errors, and tabs are for different sources
https://docs.google.com/spreadsheets/d/1nrHvRF1l0yyXsqbOvcoCjqCLxwb1b07Kd4hUrkzbwxo/edit#gid=2079801224
Languages coming from OS sources have 64 errors which we should fix.
Please find solutions that are easily automated.

@juditacs @kornai @makrai @pajkossy

LanguageArchive integration

to have alternative names

ethnologue doesn't have data with sils

set([u'osx', u'nno', u'dum', u'ojp', u'owl', u'osp', u'uga', u'tkm', u'nwx', u'ldn', u'tkv', u'pkc', u'pka', u'xep', u'nwa', u'nwc', u'ptw', u'osc', u'och', u'xmn', u'nrp', u'xtr', u'xtq', u'xmk', u'xme', u'yyr', u'gmg', u'xtz', u'auo', u'xtg', u'pgn', u'gmy', u'xto', u'hmk', u'mod', u'pgl', u'xmr', u'xlo', u'qwm', u'lre', u'zkz', u'lcq', u'apv', u'xdm', u'sjk', u'tjm', u'xwo', u'nob', u'ihw', u'lzh', u'ndf', u'nwy', u'mxi', u'nov', u'qwt', u'okm', u'okl', u'oko', u'psu', u'crr', u'xss', u'cmm', u'xsv', u'xli', u'xln', u'ang', u'cmg', u'grz', u'xle', u'xhd', u'xlc', u'umc', u'elx', u'xpc', u'crb', u'emm', u'ule', u'xsa', u'xsd', u'xlu', u'cms', u'xso', u'gev', u'xlp', u'xgl', u'xld', u'xcu', u'aru', u'xgf', u'xlg', u'hlu', u'lfn', u'odt', u'gft', u'xga', u'arc', u'olt', u'cnx', u'imy', u'lng', u'rrt', u'xgr', u'tta', u'oui', u'htx', u'yug', u'zko', u'zkh', u'goh', u'zkk', u'ddr', u'xvs', u'zkg', u'fat', u'zkb', u'onw', u'bll', u'peo', u'zra', u'xvn', u'zkt', u'zku', u'zkv', u'neu', u'xve', u'sux', u'omc', u'zbl', u'sbv', u'xly', u'plq', u'esm', u'omk', u'got', u'xht', u'yms', u'omn', u'omr', u'omp', u'que', u'ecr', u'omx', u'puq', u'ecy', u'ymt', u'xdc', u'xls', u'bue', u'obm', u'xup', u'atc', u'ett', u'xpg', u'emy', u'xpi', u'gml', u'xwc', u'akk', u'etc', u'sjn', u'wlm', u'xng', u'igs', u'krb', u'xum', u'obt', u'tlh', u'obr', u'xrn', u'xch', u'xxb', u'spx', u'twi', u'sog', u'txh', u'txg', u'xil', u'sgs', u'txc', u'txb', u'iml', u'gmh', u'mkq', u'non', u'ynn', u'caj', u'sga', u'rmv', u'xyl', u'txr', u'zsk', u'xlb', u'pie', u'twc', u'xib', u'xqa', u'xpo', u'yrm', u'xpm', u'xap', u'xaq', u'spn', u'fro', u'ims', u'frm', u'xbm', u'jpa', u'xaj', u'xis', u'xiv', u'xsc', u'xps', u'xaa', u'xpp', u'xad', u'xae', u'xpu', u'xag', u'nrk', u'aqt', u'otk', u'nrn', u'xxt', u'nrc', u'xcn', u'otb', u'xzp', u'xhr', u'nkp', u'nci', u'chb', u'pnl', u'oos', u'xpr', u'oge', u'nom', u'xcm', u'xxm', u'nrt', u'xhc', u'xha', u'xur', u'xfa', u'inm', u'ysc', u'gnc', u'avk', u'xhu', u'mtm', u'ygs', u'dlm', u'sty', u'xvo', u'xcb', u'xcc', u'xgb', u'sxc', u'xcg', u'xce', u'vol', u'pro', u'sxk', u'sxl', u'xco', u'xcl', u'sxo', u'xcr', u'czk', u'xcv', u'xcw', u'xct', u'kho', u'oht', u'ohu', u'scx', u'xcy', u'oco', u'tzl', u'aqp', u'hit', u'xna', u'yol', u'mga', u'ido', u'pmh', u'xeb', u'enl', u'enm', u'ina', u'oav', u'oar', u'ota', u'enx', u'xrt', u'pmk', u'pal', u'xrr', u'pyx', u'xrm', u'pox', u'dws', u'phn', u'aes', u'ile', u'sqr', u'xpy', u'orv', u'qyp', u'lab', u'ltc', u'jbo', u'mzg', u'kaw', u'sqn', u'xzm', u'afh', u'qya', u'xzh', u'egy', u'bzt', u'oty', u'tpn', u'nei', u'ofo', u'tpw', u'ofs', u'xbo', u'xbn', u'xno', u'svx', u'ghc', u'nxm', u'mre', u'xbb', u'xqt', u'mjy', u'xbc', u'axm'])

n/a handling

Sometimes avg, sometimes 0, have to be chosen column by column at the stage of exporting into tsv

hunspell tsv

there are arbitrary number of columns in extern/ld/res/hunspell.tsv

WP/DBPedia parser

There is a lot of info in *_language pages that can be parsed. DBPedia also parsed some of it:
http://en.wikipedia.org/wiki/Altai_language
http://dbpedia.org/page/Altai_language

And dbpedia can also be downloaded as dump.
If I understand it correctly, the field named "rdfs:comment" is the first X sentences of the page, that has to be parsed with patterns to get more alternative names.

Parsers to integrate

extern/ld/parsers/unesco_atlas_parser.py

crúbadán gave new countries that weren't in the database before

Comoros Islands
Database that we used:
http://download.geonames.org/export/dump/countryInfo.txt

Dem. Rep. of Congo
Democratic Republic of Congo
Indonesia (Irian Jaya)
Indonesia (Java and Bali)
Indonesia (Kalimantan)
Indonesia (Maluku)
Indonesia (Nusa Tenggara)
Indonesia (Sulawesi)
Indonesia (Sumatra)
Korea
Kurdistan
Malaysia (Peninsular)
Malaysia (Sabah)
Malaysia (Sarawak)
Scotland
USA
Wales
Western Samoa

save results

We should create some abstract methods into OfflineParser and/or OnlineParser that supports parser running and result saving with pickle to be easily loaded what there is possibly no change in order to be faster
There is a solution for this in dbpedia parsers, parse_and_save(), read_results(), and parse(). Maybe these are okay, but a base class should implement this.

@kornai @pajkossy @juditacs @makrai @zseder

Duplicated alternative names

unique_together

omni unknown languages

Please fill
https://docs.google.com/spreadsheets/d/1nrHvRF1l0yyXsqbOvcoCjqCLxwb1b07Kd4hUrkzbwxo/edit?usp=sharing

More flexible location representation

Since some data is more detailed and gives province info with country as well, we can handle it later.
See #37 for details

many unknown wikt-inc codes returned by WPIncubatorAdjustedSizeCounter

600 out of 900, probably a bug, to be examined

LanguageArchive alternative names

there is a part of language description starting with
Other known names and dialect names:
we should parse this into a list of alternative names

many names not found by SoftwareSupportParser()

maybe matching in lang_db.get_closest() could also contain stripping '(...)' parts
like Spanish (Mexico) --> Spanish

missing rows from Ethnologue tabular

These are the missing rows of individual, living languages.
(Most of the extinct and historic languages are also missing, as well as the macro languages.)
As a consequence now we have a 0 speaker count for all macro languages!
esy
fat
jat
jog
mzg
nno
nob
nrk
ptq
sgs
twi
yhs
yrm

disambiguate languages by script

Kirghiz vs Kirghiz (latin)

@kornai @juditacs @pajkossy @makrai

endangered level info

There are multiple levels of the different sources, currently only the last one gets saved

Incubator langauges

There are two data sources, http://incubator.wikimedia.org/wiki/Incubator:Wikis and the incubator dump.
The former contains language name, incubator code pairs, the latter contains articles.
The former is much shorter, so a lot of data from the latter doesn't get into the database right now.

L2 speaker count

We don't have any up to date informaton, the tabular with data from ethnologue does not contain it either

crubadan parser to rewrite

(new web site, format)

softwaresupport parser

mishandles spaces and tabs

Crúbadán code mismatch

There are some rows in crúbadán, where the code is "sxb" in the first column, but is "suh" in the sil column, which is not a true sil, but "sxb" is, check whether this causes any problems.

Suspicious codes:
dhg
raj
rgn
sxb
tku

duplicated rows in db

many of the new models should have a unique_together flag to avoid duplicates

hunmisc liblinear instead of maxent

Easier interface, maxent sometimes fails to compile

endangered parser

info from subsections of 'Language information by source' to be parsed

Paranan appears twice with different SILs

Paranan language appears with two different SIL codes (prf and agp). The alternative names are the same so it should be the same language.

pickle loading

baseparser silently loads pickles, should log when using pickles because it is confusing

further language variants to add

wals
Arapesh (Abu) Abu' Arapesh
Tommo So Tommo So Dogon

indi
ᏣᎳᎩ Cherokee
Mich keteer Limburgish

wikipedia list of languages
Belarusian (Taraškievica) Belarusian
Norwegian (Bokm\xc3\xa5l)
Norwegian (Nynorsk)
Palatinate German Pfaelzisch
Simple English

wikipedia incubators
Fakauvea Wallisian
Ghomala Ghomálá
Guadeloupean Creole Guadeloupean Creole French
Jamaican Jamaican Creole English
Khinalugh language Khinalug
Kildin Sami language Kildin Sami
Réunion Creole Réunion Creole French
Rinconada Rinconada Bikol
Saint Lucian Creole Saint Lucian Creole French
Solomon Islands Pijin Pijin
Standard Moroccan Amazigh Standard Moroccan Tamazigh
Unami-Lenape Unami
Wastec Huastec
Yawi Pattani Malay
Ɓasaá Basaa
Western Baluchi Western Balochi
Southern Baluchi Southern Baluchi

software support language names
...

fix pip install -e

currently it is not working
@juditacs

native language bug in crúbadán

Now somehow native name of English is English (South African). That is maybe a bug

kornai / langdeath Goto Github PK

langdeath's Introduction

langdeath's People

Contributors

Stargazers

Watchers

Forkers

langdeath's Issues

Recommend Projects

Recommend Topics

Recommend Org