indecol / country_converter Goto Github PK

The country converter (coco) - a Python package for converting country names between different classification schemes.

License: GNU General Public License v3.0

Python 89.54% TeX 9.96% Shell 0.50%

country_converter's Introduction

country converter

The country converter (coco) is a Python package to convert and match country names between different classifications and between different naming versions. Internally it uses regular expressions to match country names. Coco can also be used to build aggregation concordance matrices between different classification schemes.

Motivation

To date, there is no single standard of how to name or specify individual countries in a (meta) data description. While some data sources follow ISO 3166, this standard defines a two and a three letter code in addition to a numerical classification. To further complicate the matter, instead of using one of the existing standards, many databases use unstandardised country names to classify countries.

The country converter (coco) automates the conversion from different standards and version of country names. Internally, coco is based on a table specifying the different ISO and UN standards per country together with the official name and a regular expression which aim to match all English versions of a specific country name. In addition, coco includes classification based on UN-, EU-, OECD-membership, UN regions specifications, continents and various MRIO and IAM databases (see Classification schemes below).

Installation

Country_converter is registered at PyPI. From the command line:

pip install country_converter --upgrade

The country converter is also available from the conda forge and can be installed using conda with (if you don't have the conda_forge channel added to your conda config add "-c conda-forge", see the install instructions here):

conda install country_converter

Alternatively, the source code is available on GitHub.

The package depends on Pandas; for testing pytest is required. For further information on running the tests see CONTRIBUTING.md.

Usage

Basic usage

Use within Python

Convert various country names to some standard names:

import country_converter as coco
some_names = ['United Rep. of Tanzania', 'DE', 'Cape Verde', '788', 'Burma', 'COG',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]
standard_names = coco.convert(names=some_names, to='name_short')
print(standard_names)

Which results in ['Tanzania', 'Germany', 'Cabo Verde', 'Tunisia', 'Myanmar', 'Congo Republic', 'Iran', 'South Korea', 'North Korea']. The input format is determined automatically, based on ISO two letter, ISO three letter, ISO numeric or regular expression matching. In case of any ambiguity, the source format can be specified with the parameter 'src'.

In case of multiple conversion, better performance can be achieved by instantiating a single CountryConverter object for all conversions:

import country_converter as coco
cc = coco.CountryConverter()

some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]

standard_names = cc.convert(names = some_names, to = 'name_short')
UNmembership = cc.convert(names = some_names, to = 'UNmember')
print(standard_names)
print(UNmembership)

In order to more efficiently convert Pandas Series, the pandas_convert() method can be used. The performance gain is especially significant for large Series. For a series containing 1 million rows a 4000x speedup can be achieved, compared to convert().

import country_converter as coco
import pandas as pd
cc = coco.CountryConverter()

some_countries = pd.Series(['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic',
                  'Guatemala', 'Mexico', 'Honduras', 'Costa Rica', 'Colombia', 'Greece', 'Hungary',
                  'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania',
                  'Luxembourg', 'Malta', 'Jamaica', 'Ireland', 'Turkey', 'United Kingdom',
                  'United States'], name='country')
 
iso3_codes = cc.pandas_convert(series=some_countries, to='ISO3')

Convert between classification schemes:

iso3_codes = ['USA', 'VUT', 'TKL', 'AUT', 'XXX' ]
iso2_codes = coco.convert(names=iso3_codes, to='ISO2')
print(iso2_codes)

Which results in ['US', 'VU', 'TK', 'AT', 'not found']

The not found indication can be specified (e.g. not_found = 'not there'), if None is passed for 'not_found', the original entry gets passed through:

iso2_codes = coco.convert(names=iso3_codes, to='ISO2', not_found=None)
print(iso2_codes)

results in ['US', 'VU', 'TK', 'AT', 'XXX']

Internally the data is stored in a Pandas DataFrame, which can be accessed directly. For example, this can be used to filter countries for membership organisations (per year). Note: for this, an instance of CountryConverter is required.

import country_converter as coco
cc = coco.CountryConverter()

some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic',
                  'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary',
                  'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania',
                  'Luxembourg', 'Malta', 'Romania', 'Russia', 'Turkey', 'United Kingdom',
                  'United States']

oecd_since_1995 = cc.data[(cc.data.OECD >= 1995) & cc.data.name_short.isin(some_countries)].name_short
eu_until_1980 = cc.data[(cc.data.EU <= 1980) & cc.data.name_short.isin(some_countries)].name_short
print(oecd_since_1995)
print(eu_until_1980)

All classifications can be directly accessed by:

cc.EU28
cc.OECD

cc.EU27as('ISO3')

and the classification schemes available:

cc.valid_class

There is also a method for only getting country classifications (thus omitting any grouping of countries):

cc.valid_country_classifications

If you rather need a dictionary describing the classification/membership use:

import country_converter as coco
cc = coco.CountryConverter()
cc.get_correspondence_dict('EXIO3', 'ISO3')

to also include countries not assigned within a specific classification use:

cc.get_correspondence_dict('EU27', 'ISO2', replace_nan='NonEU')

The regular expressions can also be used to match any list of countries to any other. For example:

match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too',
               'Peoples Republic of China', 'Republic of China' ]

matching_dict = coco.match(match_these, master_list)

Country converter by default provides a warning to the python logging logger if no match is found. The following example demonstrates how to configure the coco logging behaviour.

import logging
import country_converter as coco
logging.basicConfig(level=logging.INFO)
coco.convert("asdf")
# WARNING:country_converter.country_converter:asdf not found in regex
# Out: 'not found'

coco_logger = coco.logging.getLogger()
coco_logger.setLevel(logging.CRITICAL)
coco.convert("asdf")
# Out: 'not found'

See the IPython Notebook (country_converter_examples.ipynb) for more information.

Command line usage

The country converter package also provides a command line interface called coco.

Minimal example:

coco Cyprus DE Denmark Estonia 4 'United Kingdom' AUT

Converts the given names to ISO3 codes based on matching the input to ISO2, ISO3, ISOnumeric or regular expression matching. The list of names must be separated by spaces, country names consisting of multiple words must be put in quotes ('').

The input classification can be specified with '--src' or '-s' (or will be determined automatically), the target classification with '--to' or '-t'.

The default output is a space separated list, this can be changed by passing a separator by '--output_sep' or '-o' (e.g -o '|').

Thus, to convert from ISO3 to UN number codes and receive the output as comma separated list use:

coco AUT DEU VAT AUS -s ISO3 -t UNcode -o ', '

The command line tool also allows to specify the output for none found entries, including passing them through to the output by passing None:

coco CAN Peru US Mexico Venezuela UK Arendelle --not_found=None

and to specify an additional data file which will overwrite existing country matching

coco Congo --additional_data path/to/datafile.csv

See https://github.com/IndEcol/country_converter/tree/master/tests/custom_data_example.txt for an example of an additional datafile.

The flags --UNmember_only (-u) and --include_obsolete (-i) restrict the search to UN member states only or extend it to also include currently obsolete countries. For example, the Netherlands Antilles were dissolved in 2010.

Thus:

coco "Netherlands Antilles"

results in "not found". The search, however, can be extended to recently dissolved countries by:

coco "Netherlands Antilles" -i

which results in 'ANT'.

In addition to the countries, the coco command line tool also accepts various country classifications (EXIO1, EXIO2, EXIO3, WIOD, Eora, MESSAGE, OECD, EU27, EU28, UN, obsolete, Cecilia2050, BRIC, APEC, BASIC, CIS, G7, G20). One of these can be passed by

coco G20

which lists all countries in that classification.

For the classifications covering almost all countries (MRIO and IAM classifications)

coco EXIO3

lists the unique classification names. When passing a --to parameter, a simplified correspondence of the chosen classification is printed:

coco EXIO3 --to ISO3

For further information call the help by

coco -h

Use in Matlab

Newer (tested in 2016a) versions of Matlab allow to directly call Python functions and libraries. This requires a Python version >= 3.4 installed in the system path (e.g. through Anaconda).

To test, try this in Matlab:

py.print(py.sys.version)

If this works, you can also use coco after installing it through pip (at the windows commandline - see the installing instruction above):

pip install country_converter --upgrade

And in matlab:

coco = py.country_converter.CountryConverter()
countries = {'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China'};
ISO2_pythontype = coco.convert(countries, pyargs('to', 'ISO2'));
ISO2_cellarray = cellfun(@char,cell(ISO2_pythontype),'UniformOutput',false);

Alternatively, as a long oneliner:

short_names = cellfun(@char, cell(py.country_converter.convert({56, 276}, pyargs('src', 'UNcode', 'to', 'name_short'))), 'UniformOutput',false);

All properties of coco as explained above are also available in Matlab:

coco = py.country_converter.CountryConverter();
coco.EU27
EU27ISO3 = coco.EU27as('ISO3');

These functions return a Pandas DataFrame. The underlying values can be access with .values (e.g.

EU27ISO3.values

I leave it to professional Matlab users to figure out how to further process them.

See also IPython Notebook (country_converter_examples.ipynb) for more information - all functions available in Python (for example passing additional data files, specifying the output in case of missing data) work also in Matlab by passing arguments through the pyargs function.

Building concordances for country aggregation

Coco provides a function for building concordance vectors, matrices and dictionaries between different classifications. This can be used in python as well as in matlab. For further information see (country_converter_aggregation_helper.ipynb)

Classification schemes

Currently the following classification schemes are available (see also Data sources below for further information):

ISO2 (ISO 3166-1 alpha-2) - including UK/EL for Britain/Greece (but always convert to GB/GR)
ISO3 (ISO 3166-1 alpha-3)
ISO - numeric (ISO 3166-1 numeric)
UN numeric code (M.49 - follows to a large extend ISO-numeric)
A standard or short name
The "official" name
Continent: 6 continent classification with Africa, Antarctica, Asia, Europe, America, Oceania
Continent_7 classification - 7 continent classification spliting North/South America
UN region
EXIOBASE 1 classification (2 and 3 letters)
EXIOBASE 2 classification (2 and 3 letters)
EXIOBASE 3 classification (2 and 3 letters)
WIOD classification
Eora
OECD membership (per year)
MESSAGE 11-region classification
IMAGE
REMIND
UN membership (per year)
EU membership (including EU12, EU15, EU25, EU27, EU27_2007, EU28)
EEA membership
Schengen region
Cecilia 2050 classification
APEC
BRIC
BASIC
CIS (as by 2019, excl. Turkmenistan)
G7
G20 (listing all EU member states as individual members)
FAOcode (numeric)
GBDcode (numeric - Global Burden of Disease country codes)
IEA (World Energy Balances 2021)
DACcode (numeric - OECD Development Assistance Committee)
ccTLD - country code top-level domains
GWcode - Gledisch & Ward numerical codes as published in https://www.andybeger.com/states/articles/statelists.html
CC41 - common classification for MRIOs (list of countries found in all public MRIOs)
IOC - International Olympic Committee (IOC) country codes

Coco contains official recognised codes as well as non-standard codes for disputed or dissolved countries. To restrict the set to only the official recognized UN members or include obsolete countries, pass

import country_converter as coco
cc = coco.CountryConverter()
cc_UN = coco.CountryConverter(only_UNmember=True)
cc_all = coco.CountryConverter(include_obsolete=True)

cc.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')
cc_UN.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')
cc_all.convert(['PSE', 'XKX', 'EAZ', 'FRA'], to='name_short')

cc results in ['Palestine', 'Kosovo', 'not found', 'France'], whereas cc_UN converts to ['not found', 'not found', 'not found', 'France'] and cc_all converts to ['Palestine', 'Kosovo', 'Zanzibar', 'France'] Note that the underlying dataframe is available at the attribute .data (e.g. cc_all.data).

Data sources and further reading

Most of the underlying data can be found in Wikipedia, the page describing ISO 3166-1 is a good starting point. The page on the ISO2 codes includes a section "Imperfect Implementations" explaining the GB/UK and EL/GR issue. UN regions/codes are given on the United Nation Statistical Division (unstats) webpage. The differences between the ISO numeric and UN (M.49) codes are also explained at wikipedia. EXIOBASE, WIOD and Eora classification were extracted from the respective databases. For Eora, the names are based on the 'Country names' csv file provided on the webpage, but updated for different names used in the Eora26 database. The MESSAGE classification follows the 11-region aggregation given in the MESSAGE model regions description. The IMAGE classification is based on the "region classification map", for REMIND we received a country mapping from the model developers.

The membership of OECD and UN can be found at the membership organisations' webpages, information about obsolete country codes on the Statoids webpage.

The situation for the EU got complicated due to the Brexit process. For the naming, coco follows the Eurostat glossary, thus EU27 refers to the EU without UK, whereas EU27_2007 refers to the EU without Croatia (the status after the 2007 enlargement). The shortcut EU always links to the most recent classification. The EEA agreements for the UK ended by 2021-01-01 (which also affects Guernsey, Isle of Man, Jersey and Gibraltar). Switzerland is not part of the EEA but member of the single market.

The Global Burden of Disease country codes were extracted form the GBD code book available here.

Communication, issues, bugs and enhancements

Please use the issue tracker for documenting bugs, proposing enhancements and all other communication related to coco.

You can follow me on mastodon - @[email protected] and twitter to get the latest news about all my open-source and research projects (and occasionally some random retweets/toots).

Contributing

Want to contribute? Great! Please check CONTRIBUTING.md if you want to help to improve coco and for some pointer for how to add classifications.

Related software

The package pycountry provides access to the official ISO databases for historic countries, country subdivisions, languages and currencies. In case you need to convert non-English country names, countrynames includes an extensive database of country names in different languages and functions to convert them to the different ISO 3166 standards. Python-iso3166 focuses on conversion between the two-letter, three-letter and three-digit codes defined in the ISO 3166 standard.

If you are using R, you should have a look at countrycode.

Citing the country converter

Version 0.5 of the country converter was published in the Journal of Open Source Software. To cite the country converter in publication please use:

Stadler, K. (2017). The country converter coco - a Python package for converting country names between different classification schemes. The Journal of Open Source Software. doi: 10.21105/joss.00332

For the full bibtex key see CITATION

Acknowledgements

This package was inspired by (and the regular expression are mostly based on) the R-package countrycode by Vincent Arel-Bundock and his (defunct) port to Python (pycountrycode). Many thanks to Robert Gieseke for the review of the source code and paper for the publication in the Journal of Open Source Software.

country_converter's People

Contributors

Stargazers

Watchers

Forkers

cynepiaadmin amatthies mcarans bixiou rlonka willmoggridge jpatokal darxie lakshmisethuraman msuryaprakash fagan2888 hamedxrf sjoerdherlaar muhammadumerjaved44 aderfo pythobot rt-hamilton sbrugman neronj daphnetsolissou candydeck wony111per joh-ku camorales197 noammosko leberknecht deep-discovery jingyuan92 mohamedhbadr elgarteo dancardin odidev avinashpuresoftware priya1puresoftware niwreg-coder dav-lizarraga jm-rivera zoranmihov mwaiton saurabhnolakha kendrick-onyango qiushipeng vasdbs diogokramel plotski ennamarie19 alanorth dehallo kajwan emekeh pickavetandreas emerson1337 petermaxwell mwtoews petervanness hazimhussein cfgexe arpitjain799 sugarplumchum73 cicerooslo omar-qusous-ea yanikusgg jimmy927 azrael3000 lvhao1996 marthhoi marystudy98 jerryj1964

country_converter's Issues

OECD Countries Update

Latvia joined the OECD 2016,
could you please update your data?
Should I write a pull request?

I write my telegram bot using this module and it can't find some countries

Some of them work off the code.

e.g if I input US or USA to my bot it prints not found but if I input United States of America it works (the same issue with the UK)
The error in my console is: WARNING:root: not found not found in regex
Also if I print South Korea it gives me the error that says: TypeError: can only concatenate str (not "list") to str
But it perfectly works with North Korea even when I call it DPRK (which is regarding the previous problem with the USA and the UK)
Furthermore, it doesn't work with Ireland. I am completely confused about this one since it's not an abbreviation and not 2 words long like South Korea.
The error is completely the same as the first one: WARNING:root:not found not found in regex

I can provide you with my code if you want me to (just tell me about it)

It would be really nice if you could help me and solve all these issues or some of them

Thanks

Statement of need

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?

Could you elaborate a bit more on this in the Readme? Mentioning ISO codes, numbers as examples might be useful. I don't think it needs to be as detailed as in the paper but some example usages might be helpful.

Add BRICS

Ignore IPython checkpoints

Maybe you could add .ipynb_checkpoints to your .gitignore? I think this is just storing the manually saved versions.

Spellcheck

For both doc notebooks!!!

License question

As part of my review (openjournals/joss-reviews#332) a question on licensing:

More than anything, the choice of license is also cultural (https://twitter.com/hadleywickham/status/873554179792355328) and it seems that permissive licenses are more popular for Python projects. Given that country_converter is "infrastructure" and not a scientific model, maybe it could make sense to release under BSD or MIT? This could improve its adoption, I have noted that some Python projects are reluctant to include GPL-ed code. There are of course good reasons to choose GPL and I have myself used it in scientific projects.

If you chose GPL because it in part builds on https://github.com/vincentarelbundock/countrycode, then maybe its authors should be mentioned in the LICENSE file?

DOI for “The World Input- Output Database (WIOD).” Working Paper

It seems the Working Paper editor added the paper to ResearchGate and assigned a doi:

https://www.researchgate.net/publication/287208987_The_World_Input-Output_Database_WIOD_Contents_Sources_and_Methods

Not sure if it's better to use this doi, adding it to the BibTeX file seems to not display the wiod.org url anymore when running pandoc.

It's the first time I notice a RG doi and there seem to have been some questions around them:

http://blog.impactstory.org/researchgate-doi/

So it's probably ok to leave the references as is, just thought I mention it.

(openjournals/joss-reviews#332)

Non-standard codes

After #24 I wanted to compare country_converter (which I use a lot as coco) - with pycountry which covers only ISO3 codes:

import country_converter
import pycountry
data = pd.read_table(country_converter.COUNTRY_DATA_FILE, sep='\t', encoding='utf-8')
for _, (code, name) in data[['ISO3', 'name_short']].iterrows():
    try:
        pycountry.countries.get(alpha_3=code).name
    except KeyError:
        print(code, name)

This gives these (non-standard) codes:

BA1 British Antarctic Territories
CHI Channel Islands
KSV Kosovo
ANT Netherlands Antilles
EAT Tanganjika
EAZ Zanzibar

Not sure whether these are partly former ones. For Kosovo XK, and XKK seem to be used as placeholders:
https://geonames.wordpress.com/2010/03/08/xk-country-code-for-kosovo/

Maybe it's worth making it explicit in the docs that codes are amended.

Pandas deprecation warning

country_converter.py:412: FutureWarning: read_table is deprecated, use read_csv instead

missing continents for some countries using UNregion

I am not sure if this is UNRegion issue or country-converter but running:
countries = ['AQ', 'GS', 'IO', 'CX']
country_converter.convert(names=countries,src='ISO2' , to='UNregion')
I get NaN

AQ - Antarctica, GS - SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS,
CX - COCOS (KEELING) ISLANDS, IO - British Indian Ocean Territory

Wrong matching

If excluding before country, its considered the country
e.g. Asia excluding China matches China

Kosovo XKX not KSV

Thanks for your project!

I think Kosovo's 3 letter code should be XKX not KSV.

Coerce inputs to convert to list

Hi, thanks for the library.

I got bitten by unexpected behaviour upon passing a pandas series to convert. I would have expected the code either to raise a TypeError or to handle the input correctly. Instead, the series gets converted to a single string, which is matched against the country regexes. In many cases this actually gives the correct result in the end (although with Warning: More than one regular expression match for [list of all countries in series] being printed a million times). In a few cases, the formatting of the series string somehow prevents a match, and so a country or two are not included in the result.

These lines are the root of the problem:

names = list(names) if (
   isinstance(names, tuple) or
    isinstance(names, set)) else names

names = names if isinstance(names, list) else [names]

names = [str(n) for n in names]

If names is a pandas series, after these lines names will be a one-element list containing a single string representing the entire series (the result of calling str on the series).
The same thing will happen if you input a numpy array, or in fact anything which implements __repr__.

I suggest changing the code so it tries to coerce the input to a list, and raises a TypeError upon failure. For instance we could change the above lines to this:

if not isinstance(names, str):
    try:
        names = list(names)
    except TypeError as e:
        raise TypeError("names must be coercible to list") from e
    names = [str(n) for n in names]
else:
    names = [names]

Happy to make a PR if you agree.

Thanks again for the library, saved me tonnes of work.

Community guidelines

With regards to the "Community guidelines" question in openjournals/joss-reviews#332

Community guidelines: Are there clear guidelines for third parties wishing to

Contribute to the software

Report issues or problems with the software

Seek support

Maybe adding a CONTRIBUTING.md file would be useful? Also mentioning the issue tracker in the README.md could be helpful as people might come across the project e.g. from its PyPI page.

I would also be interested to read what kind of additional groupings you would consider for inclusion.

Remove or mark obsolete Zanzibar, Tanganjika and Netherlands Antilles

"In April 1964, the republic (Zanzibar) merged with mainland Tanganyika. This United Republic of Tanganyika and Zanzibar was soon renamed, blending the two names, as the United Republic of Tanzania,"
https://en.wikipedia.org/wiki/Zanzibar

Congo Dem. Rep.

Great job on this library. Easy to use and doing what it says on the can.
I had an issue with Congo vs DRC though. See the reproducible snippet below:

import country_converter as coco
some_names = ['Congo, Dem. Rep.',  'Congo, Rep.']
standard_names = coco.convert(names=some_names, to='name_short')
In [96]: standard_names
Out[96]: [['Congo Republic', 'DR Congo'], 'Congo Republic']

expected output is ['DR Congo', 'Congo Republic']

"Republic of Ireland"

First of all thanks for building this. I've been reconciling some geographic data by hand the tedious way so far and this looks like it's going to be a huge time saver. The one country string from my data source that country_converter didn't recognize was "Republic of Ireland", which is one of the official names of Ireland. See: [ https://en.wikipedia.org/wiki/Republic_of_Ireland ]. For myself, I'm just going to put in an if statement to catch it, but I thought you might want to know.

Thanks again!

Help message if no parameters are given

Currently, when no parameters are given, coco is not printing anything.
Instead, this should give a help message.

Git version tag

Version: Does the release version given match the GitHub release (v0.4)?

Could you tag the version published on PyPI in Git as well? This is also helpful for the Zenodo archiving.

(openjournals/joss-reviews#332)

Wrong ISO3 code for Palestine

According to
https://unstats.un.org/unsd/tradekb/Knowledgebase/Country-Code
and
https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
PSE is the valid ISO3 code for Palestine (instead of PAL).
Needs to change.

Clarify Python version in setup.py

Maybe add

Programming Language :: Python :: 3

or similar to setup.py

(openjournals/joss-reviews#332)

Aggregation wiod example broken

In the doc notebook the example with a separate germany appears to be broken.

Country: Yugoslavia is not the the .tsv file

WARNING:root: Yugoslavia not found in regex

correcting names within a dataframe

Thank you for this package to @konstantinstadler
I have a quick question to whoever can answer and am wondering if we can do this using this package? I have a dataframe (in pandas format) and have a list of countries on one column. There could be multiple duplicates. Example:
df = pd.DataFrame({"Country":["Afghanistan ","Afghanistan ", " Afghanistan", "Åland Island", "Åland Islan"], "Values":[23,45,46,787,875]})

As you can see, some of them have trailing whitespace, some names are not complete, and etc. I know we can fix this issue using regex. But I am wondering if you have already done this for us? Let's assume I want my county name fixed according to UN member. I want a extra column which has fixed all errors i.e. col = Country, Values, Country_NameFixed
Goal : I want to input a df and want a pd.DF at the end.

Links to related software

Maybe add links to related software like

pycountry, iso-3166 or countrynames (for non-English names)

This could help potential users to decide which one they need and see advantages and features of country_converter

(openjournals/joss-reviews#332)

Heard and Mc Donald Islands

I think "Heard and Mc Donald Islands" should not have a space after the "Mc".

How to run tests

Could you maybe add some instructions on how to run the tests? This would be helpful for people not familiar with py.test.

I used (after cloning the repo):

python3 -m venv venv
./venv/bin/pip install -e .
./venv/bin/pip install pytest
./venv/bin/pytest --verbose

(openjournals/joss-reviews#332)

String "Netherlands Antilles" not matched to country

Huge fan of this library, but it's not currently matching the Netherlands Antilles:

import country_converter as cc
>> cc.convert(names=['Netherlands Antilles'])
WARNING:root:Netherlands Antilles not found in regex
'not found'

The issue seems to be the regex, although it's not entirely clear to me why since regex101 thinks it should work:

>>> import re
>>> re.match('^(?=.*\bant).*(neth.*|dutch)', 'netherlands antilles')
>>>

A simpler regex without the lookahead seems to work fine:

>>> re.match('^(neth.*|dutch).*ant', 'netherlands antilles')
<re.Match object; span=(0, 15), match='netherlands ant'>

Replace Channel Islands with Jersey and Guernsey

UNStats https://unstats.un.org/unsd/methodology/m49/ has Jersey and Guernsey. Remove or mark obsolete Channel Islands and add Jersey and Guernsey.

Namibia issue with ISO2

NA is Namobia, but when I run:
country_converter.convert(names='NA',src='ISO2' , to='UNregion')`
I get not found

in country-converter (0.5.2)

Issue with Romania

From Evert:

I believe the ISO3 code for Romania is ROU, not ROM. (see: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3)

Automatically detect ISO2, ISO3,UNnumeric and name input

multiple matching

If country names are stated multiple times, multiple country codes show.

e.g. China (P.R. of China and Hong Kong, China) becomes '[''CHN'', ''HKG'']' in ISO3 codes

Use a specific logger

Instead of using logging.warning and so on, could you get a specific logger and use that instead?
Otherwise, I think it's impossible to change the log level only for this package.
I.e.
This code does not print anything:

import logging
from country_converter import CountryConverter

logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.CRITICAL)
my_logger = logging.getLogger('mylogger')
my_logger.info('test')
logging.info('test')
logging.getLogger().info('test')
CountryConverter().convert(['france', 'sadflkj'], src='regex')

While this code prints my log message, but also a warning from CountryCoverter:

import logging
from country_converter import CountryConverter

logging.basicConfig(level=logging.DEBUG)
my_logger = logging.getLogger('mylogger')
my_logger.info('test')
CountryConverter().convert(['france', 'sadflkj'], src='regex')

Can coco.match warnings be suppressed?

I'm not sure if this is best done outside or inside of the match function, but usually when I run the function I'm aware that some matches won't be found and will look for and deal with multiple matches. Meanwhile the long list of warnings pushes useful information further away in my terminal. Can the warnings be turned off or piped somewhere out of the way?

Add EU classificatons

EU15
EU21
EU27x (temp for brexit)

Add G8 and G20

Disable warning logging

By default, this package subscribes to root logging and uses logging.warning()

How to disable it without affecting root logger of the rest of the app?

Add third option to show all countries but not obsolete

It's great you have UN recognised as an option, but I think you need one more which is all countries except those that have become obsolete. One reason is that if you list countries in a region using the all option eg. Europe, you could get duplicates eg. getting Channel Islands as well as Jersey and Guernsey. UN recognised can be too strict eg. excluding Taiwan.

Does not work for "uk" and short name issue for Macau

Issue #1

cc.convert(names='UK',to='name_short')

gives you the following result

WARNING:root:UK not found in ISO2
'not found'

the code looks only for ISO2 for 2 letter input, even though the regex check would have captured this correctly

Issue #2
cc.convert(names='macau',to='name_short') gives macao

Ideal name is https://en.wikipedia.org/wiki/Macau
Macau also works with another python package "CountryInfo"

MESSAGE regions

Add regions from MESSAGE:
http://www.iiasa.ac.at/web/home/research/researchPrograms/Energy/MESSAGE-model-regions.en.html

New CLI method for regions

Provide a possibility to do something like
coco OECD
and getting all OECD countries

coco EXIO1 should give a unique list of EXIO1 countries ...

Receiving regex error when trying on a list of values

I receive the following error when trying to convert a list of countries:

File "sample_scripts.py", line 21, in <module> names=list(data_frame["location"]), to="name_short" File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/country_converter/country_converter.py", line 319, in convert return coco.convert(*args, **kargs) File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/country_converter/country_converter.py", line 540, in convert na=False)][to].values] File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 1954, in wrapper return func(self, *args, **kwargs) File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 2763, in contains self._parent, pat, case=case, flags=flags, na=na, regex=regex File "/Users/simon/Academic/dev/env/lib/python3.7/site-packages/pandas/core/strings.py", line 441, in str_contains regex = re.compile(pat, flags=flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 234, in compile return _compile(pattern, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 924, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 420, in _parse_sub not nested and not items)) File "/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 645, in _parse source.tell() - here + len(this))

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html

EU28as?

But this certainly depends on familiarity and convention ...

(openjournals/joss-reviews#332)

indecol / country_converter Goto Github PK

country_converter's Introduction

country converter

Motivation

Installation

Usage

Basic usage

Use within Python

Command line usage

Use in Matlab

Building concordances for country aggregation

Classification schemes

Data sources and further reading

Communication, issues, bugs and enhancements

Contributing

Related software

Citing the country converter

Acknowledgements

country_converter's People

Contributors

Stargazers

Watchers

Forkers

country_converter's Issues

Recommend Projects

Recommend Topics

Recommend Org