kornai / langdeath Goto Github PK
View Code? Open in Web Editor NEWLanguage death
Language death
classification error on the seed data concern mainly macro languages (or languages contained by a macro language).
this may be caused by wikipedia features not being coherently added to macro/'largest individual in group'
maybe we could use 'wp_macro...', wp_individual...' extra features
@kornai How to deal with languages that are macro-languages?
There is a mapping here:
http://www-01.sil.org/iso639-3/macrolanguages.asp
but even though sol supposed to be a collective/macro language:
http://www-01.sil.org/iso639-3/documentation.asp?id=son
It's not on this list.
(you can reply to this email directly)
unless it returns dicts with sil key
@makrai res/office_if_pack is the parsed version of http://office.microsoft.com/en-001/downloads/office-language-interface-pack-lip-downloads-HA001113350.aspx, and lists some languages, but not for example Hungarian, but we are sure that Office is available in Hungarian. Could you check this?
https://translations.launchpad.net/ubuntu/vivid/+translations
https://launchpad.net/ubuntu/wily/+source/language-pack-hi
(it is not trivial to download all, since not all codes are of lenght 2)
insted of maxent module
set([u'son', u'tzs', u'art', u'tzt', u'hva', u'tzz', u'phi', u'cbm', u'gsc', u'tgg', u'prv', u'tze', u'dmn', u'nhs', u'btb', u'tzu', u'apa', u'pob', u'lnc', u'dlc', u'ixi', u'gmo', u'blu', u'nah', u'nai', u'cai', u'tot', u'bnt', u'mvc', u'znd', u'auv', u'cnm', u'sio', u'wen', u'cke', u'daf', u'noo', u'hsf', u'qut', u'bih', u'oto', u'raj ', u'lms', u'mms', u'omq', u'pqw', u'quj', u'suh', u'acc', u'que ', u'ccx', u'jpx', u'eml', u'azc', u'cki', u'tzc', u'ckk', u'fiz', u'ckw', u'ccn', u'sum', u'mvj', u'mly', u'cti', u'agp'])
There are some heuristics that will find new languages
there are a few cases when the value of the champion(from dbpedia infobox dump) cannot be added because the champion's SIL code is retired (typically to those language for which it is the champion):
bhk gio jar kdv kzh mgx mzf stc sum unp nln
(for the majority of the cases the champion is a macrolanguage (45 langs) or individual language (35 langs) and has a SIL code)
some parsers yield dicts where the key for alternate names in other than 'alt_names'
so these names don't get into the database
(e.g. ethnologueparser)
http://glottolog.org/glottolog/language (walsinfo parszer biztos jó lesz hozzá)
http://dobes.mpi.nl/
http://www.verbix.com/maps/where-do-they-speak/
here:
https://blogs.windows.com/windowsexperience/2014/02/05/over-7000-languages-just-1-windows/
stands
"So by including the main 50 scripts, we can support 7,000 languages, which is enough to support text input for about 98% of people in the world for at least one of the languages they speak. ... Not all of the 7,000 supported languages appear in the list of languages when you’re looking to add one in PC settings or Control Panel. But you can find any supported language in Windows by searching for the name of the language or its IETF tag (Open the Search charm, search for “Add a language,” and then click Add a language)."
so it looks like there is no way to export this list,
getting the language pack is easier:
http://windows.microsoft.com/en-us/windows/language-packs#lptabs=win10
parsing error; appearently wikipedia site has changed
Now I renamed the shared doc from omni errors to langdeath errors, and tabs are for different sources
https://docs.google.com/spreadsheets/d/1nrHvRF1l0yyXsqbOvcoCjqCLxwb1b07Kd4hUrkzbwxo/edit#gid=2079801224
Languages coming from OS sources have 64 errors which we should fix.
Please find solutions that are easily automated.
to have alternative names
set([u'osx', u'nno', u'dum', u'ojp', u'owl', u'osp', u'uga', u'tkm', u'nwx', u'ldn', u'tkv', u'pkc', u'pka', u'xep', u'nwa', u'nwc', u'ptw', u'osc', u'och', u'xmn', u'nrp', u'xtr', u'xtq', u'xmk', u'xme', u'yyr', u'gmg', u'xtz', u'auo', u'xtg', u'pgn', u'gmy', u'xto', u'hmk', u'mod', u'pgl', u'xmr', u'xlo', u'qwm', u'lre', u'zkz', u'lcq', u'apv', u'xdm', u'sjk', u'tjm', u'xwo', u'nob', u'ihw', u'lzh', u'ndf', u'nwy', u'mxi', u'nov', u'qwt', u'okm', u'okl', u'oko', u'psu', u'crr', u'xss', u'cmm', u'xsv', u'xli', u'xln', u'ang', u'cmg', u'grz', u'xle', u'xhd', u'xlc', u'umc', u'elx', u'xpc', u'crb', u'emm', u'ule', u'xsa', u'xsd', u'xlu', u'cms', u'xso', u'gev', u'xlp', u'xgl', u'xld', u'xcu', u'aru', u'xgf', u'xlg', u'hlu', u'lfn', u'odt', u'gft', u'xga', u'arc', u'olt', u'cnx', u'imy', u'lng', u'rrt', u'xgr', u'tta', u'oui', u'htx', u'yug', u'zko', u'zkh', u'goh', u'zkk', u'ddr', u'xvs', u'zkg', u'fat', u'zkb', u'onw', u'bll', u'peo', u'zra', u'xvn', u'zkt', u'zku', u'zkv', u'neu', u'xve', u'sux', u'omc', u'zbl', u'sbv', u'xly', u'plq', u'esm', u'omk', u'got', u'xht', u'yms', u'omn', u'omr', u'omp', u'que', u'ecr', u'omx', u'puq', u'ecy', u'ymt', u'xdc', u'xls', u'bue', u'obm', u'xup', u'atc', u'ett', u'xpg', u'emy', u'xpi', u'gml', u'xwc', u'akk', u'etc', u'sjn', u'wlm', u'xng', u'igs', u'krb', u'xum', u'obt', u'tlh', u'obr', u'xrn', u'xch', u'xxb', u'spx', u'twi', u'sog', u'txh', u'txg', u'xil', u'sgs', u'txc', u'txb', u'iml', u'gmh', u'mkq', u'non', u'ynn', u'caj', u'sga', u'rmv', u'xyl', u'txr', u'zsk', u'xlb', u'pie', u'twc', u'xib', u'xqa', u'xpo', u'yrm', u'xpm', u'xap', u'xaq', u'spn', u'fro', u'ims', u'frm', u'xbm', u'jpa', u'xaj', u'xis', u'xiv', u'xsc', u'xps', u'xaa', u'xpp', u'xad', u'xae', u'xpu', u'xag', u'nrk', u'aqt', u'otk', u'nrn', u'xxt', u'nrc', u'xcn', u'otb', u'xzp', u'xhr', u'nkp', u'nci', u'chb', u'pnl', u'oos', u'xpr', u'oge', u'nom', u'xcm', u'xxm', u'nrt', u'xhc', u'xha', u'xur', u'xfa', u'inm', u'ysc', u'gnc', u'avk', u'xhu', u'mtm', u'ygs', u'dlm', u'sty', u'xvo', u'xcb', u'xcc', u'xgb', u'sxc', u'xcg', u'xce', u'vol', u'pro', u'sxk', u'sxl', u'xco', u'xcl', u'sxo', u'xcr', u'czk', u'xcv', u'xcw', u'xct', u'kho', u'oht', u'ohu', u'scx', u'xcy', u'oco', u'tzl', u'aqp', u'hit', u'xna', u'yol', u'mga', u'ido', u'pmh', u'xeb', u'enl', u'enm', u'ina', u'oav', u'oar', u'ota', u'enx', u'xrt', u'pmk', u'pal', u'xrr', u'pyx', u'xrm', u'pox', u'dws', u'phn', u'aes', u'ile', u'sqr', u'xpy', u'orv', u'qyp', u'lab', u'ltc', u'jbo', u'mzg', u'kaw', u'sqn', u'xzm', u'afh', u'qya', u'xzh', u'egy', u'bzt', u'oty', u'tpn', u'nei', u'ofo', u'tpw', u'ofs', u'xbo', u'xbn', u'xno', u'svx', u'ghc', u'nxm', u'mre', u'xbb', u'xqt', u'mjy', u'xbc', u'axm'])
Sometimes avg, sometimes 0, have to be chosen column by column at the stage of exporting into tsv
there are arbitrary number of columns in extern/ld/res/hunspell.tsv
There is a lot of info in *_language pages that can be parsed. DBPedia also parsed some of it:
http://en.wikipedia.org/wiki/Altai_language
http://dbpedia.org/page/Altai_language
And dbpedia can also be downloaded as dump.
If I understand it correctly, the field named "rdfs:comment" is the first X sentences of the page, that has to be parsed with patterns to get more alternative names.
Comoros Islands
Database that we used:
http://download.geonames.org/export/dump/countryInfo.txt
Dem. Rep. of Congo
Democratic Republic of Congo
Indonesia (Irian Jaya)
Indonesia (Java and Bali)
Indonesia (Kalimantan)
Indonesia (Maluku)
Indonesia (Nusa Tenggara)
Indonesia (Sulawesi)
Indonesia (Sumatra)
Korea
Kurdistan
Malaysia (Peninsular)
Malaysia (Sabah)
Malaysia (Sarawak)
Scotland
USA
Wales
Western Samoa
We should create some abstract methods into OfflineParser and/or OnlineParser that supports parser running and result saving with pickle to be easily loaded what there is possibly no change in order to be faster
There is a solution for this in dbpedia parsers, parse_and_save()
, read_results()
, and parse()
. Maybe these are okay, but a base class should implement this.
unique_together
Since some data is more detailed and gives province info with country as well, we can handle it later.
See #37 for details
600 out of 900, probably a bug, to be examined
there is a part of language description starting with
Other known names and dialect names:
we should parse this into a list of alternative names
maybe matching in lang_db.get_closest() could also contain stripping '(...)' parts
like Spanish (Mexico) --> Spanish
These are the missing rows of individual, living languages.
(Most of the extinct and historic languages are also missing, as well as the macro languages.)
As a consequence now we have a 0 speaker count for all macro languages!
esy
fat
jat
jog
mzg
nno
nob
nrk
ptq
sgs
twi
yhs
yrm
There are multiple levels of the different sources, currently only the last one gets saved
There are two data sources, http://incubator.wikimedia.org/wiki/Incubator:Wikis and the incubator dump.
The former contains language name, incubator code pairs, the latter contains articles.
The former is much shorter, so a lot of data from the latter doesn't get into the database right now.
We don't have any up to date informaton, the tabular with data from ethnologue does not contain it either
(new web site, format)
mishandles spaces and tabs
There are some rows in crúbadán, where the code is "sxb" in the first column, but is "suh" in the sil column, which is not a true sil, but "sxb" is, check whether this causes any problems.
Suspicious codes:
dhg
raj
rgn
sxb
tku
many of the new models should have a unique_together flag to avoid duplicates
Easier interface, maxent sometimes fails to compile
info from subsections of 'Language information by source' to be parsed
Paranan language appears with two different SIL codes (prf and agp). The alternative names are the same so it should be the same language.
baseparser silently loads pickles, should log when using pickles because it is confusing
wals
Arapesh (Abu) Abu' Arapesh
Tommo So Tommo So Dogon
indi
ᏣᎳᎩ Cherokee
Mich keteer Limburgish
wikipedia list of languages
Belarusian (Taraškievica) Belarusian
Norwegian (Bokm\xc3\xa5l)
Norwegian (Nynorsk)
Palatinate German Pfaelzisch
Simple English
wikipedia incubators
Fakauvea Wallisian
Ghomala Ghomálá
Guadeloupean Creole Guadeloupean Creole French
Jamaican Jamaican Creole English
Khinalugh language Khinalug
Kildin Sami language Kildin Sami
Réunion Creole Réunion Creole French
Rinconada Rinconada Bikol
Saint Lucian Creole Saint Lucian Creole French
Solomon Islands Pijin Pijin
Standard Moroccan Amazigh Standard Moroccan Tamazigh
Unami-Lenape Unami
Wastec Huastec
Yawi Pattani Malay
Ɓasaá Basaa
Western Baluchi Western Balochi
Southern Baluchi Southern Baluchi
software support language names
...
currently it is not working
@juditacs
Now somehow native name of English is English (South African). That is maybe a bug
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.