Giter Site home page Giter Site logo

worldbank / gld Goto Github PK

View Code? Open in Web Editor NEW
14.0 6.0 11.0 260.4 MB

This is the repository for the Global Labor Database (GLD). It aims to contain all necessary information to understand what the GLD is and how it functions. It does not, however, contain any microdata. For any questions please contact the Focal Point ([email protected]).

Home Page: https://worldbank.github.io/gld/

License: MIT License

Stata 99.83% R 0.17% TeX 0.01%
survey harmonization labor-market global

gld's Introduction

Welcome to the Global Labor Database Repository

This is repository and homepage of the Jobs Group’s Global Labor Database (GLD) at the World Bank. Here you will find all the information to understand, recreate, and edit the GLD.

The GLD aims to cover all labor force surveys worldwide, with a focus on lower income countries, though on occasion the GLD team may cover other household surveys with a sufficient labor module.

Around the harmonization the GLD team aims to create an open and collaborative ecosystem of contextual information and tools that helps users more quickly understand the data and leverage it. When it comes to working with labor microdata you should not start at square one and you shouldn’t have to travel alone: the GLD team is there to assist.

We hope, in turn, you will also contribute to GLD, from pointing out issues with our codes or concepts to proposing new harmonizations. We aim to establish a new standard of transparency and collaboration. Please feel free to navigate the website as you please though we recommend first-timers to start with the introduction to the GLD.



Turner's Fighting Temeraire

The Fighting Temeraire - Joseph Mallord William Turner

gld's People

Contributors

alexandraqn avatar buscandoaverroes avatar david-berm avatar elbeejay avatar g4brielvs avatar giofsantos11 avatar gld-focal-point avatar gronert-m avatar junying-t avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gld's Issues

Check number of observations against documented cases

The PSA documents the number of observations explicitly on their website for each round. And potentially also in their metadata files, but that needs to be investigated. Add and store this information in each year do file, where available, incorporate into automated checks later.

Vocational training variable in INDIA EUE

This issue is concerning the GLD variable called vocational, labelled as "ever received vocational training." To me, a straightforward variable equivalent in the India survey is Vocational_Training defined as "whether receiving or received any vocational training." However, there are a few years when this variable is not available.

There is also a variable in the EUE called Technical_Education, which seem more or less like a proxy for vocational training. In the year 1999, for instance, @gronert-m used this variable for vocational and that is what I did for the years when Vocational_Training variable is not present.

However, in years when both variables are present, I noticed that they are a bit different as can be seen from the example below (from 2005 survey). For instance, 2889 people who reported receiving formal vocational training reported not receiving any technical education.

My question is: do you think I should still use Technical_Education when Vocational_Training variable is not available considering that they seem to be quite different? Or should I leave the GLD variable, vocational missing when only Technical_Education is available? Or should I just consistently use Technical_Education for all years since this is consistently available? Not sure if I stated the problem clearly, but please feel free to ask for clarifications.

<style> </style>
Education Receive formal vocational training Received vocational training: formal Non-formal: hereditary Non-formal: others Did not receive any vocational training Total
Non technical education 1,243 1,646 2,720 3,046 102,053 110,708
Technical degree in agri/engineering, etc. 21 37 1 4 195 258
Diploma in agriculture 6 10 1 0 24 41
Angineering/technology 128 495 10 12 661 1,306
Medicine 15 68 1 4 103 191
Crafts 12 58 6 2 41 119
Other subjects 105 334 8 38 480 965
diploma or certificate in agriculture 1 10 0 0 6 17
Engineering/technology 23 75 1 2 179 280
Medicine 5 16 0 2 54 77
Crafts 6 8 1 0 21 36
             
Total 1,595 2,838 2,753 3,120 104,009 114,315

Review Industry Codes

Error spotted for in the industry coding: industry should not be generated like

	gen byte industry=floor(c18_pkb/10)
	replace industry=1 if c18_pkb >= 1 & c18_pkb <= 9
	replace industry=2 if c18_pkb==10 | c18_pkb==11

But instead like

	gen byte industry=.
	replace industry=1 if (c18_pkb>=1& c18_pkb<=4)		// to Agriculture
	replace industry=2 if (c18_pkb>=5 & c18_pkb<=9)		// to Mining

review all industry and industry_2 coding to verify that the correct logic is being used.

I2D2 various small variable coding harmonization checks

Double-check the following across all years, once a draft code is complete:

  • labor module age is correct for each year
  • education modulate age is correct for each year
  • Household size logic is correct for each year (remember, include all non-family members but not boarders/workers)
  • Consistency in restrictions on industry codes by number of jobs (njobs) across survey years
  • the [weight var]/(10000 * 4) logic for weights
  • edit month variable to be manual-generated instead of survey-month (involves adding to iecodebook)? #47

Duplicate records in 1987 India Employment - Unemployment Survey

The individual-level datasets of the India 1987 EUE survey has 88 duplicates with 44 distinct person ID. However, these are duplicates only in terms of person ID variable but not with regards to the other characteristics, like age and educational attainment. Dropping one of the duplicates would make sense if it constitutes a small fraction of the total sample, and does not affect the survey's representation. Hence, it is important to take a look at the distribution of the duplicates across the states.

The table below shows that most of the duplicate persons are in West Bengal (0.2% of the total sample). If we were to keep one record, then that would only remove 0.1% of the total sample. Since that is a tiny fraction of the sample, dropping one of the duplicates would not be very problematic.

image

But how can we select which duplicates to keep? I identify the following set of rules that are admittedly not perfect but prioritizes keeping the variables with more useful information:
(1) Keep the duplicate with non-missing age. This is because age is an important filter variable for many labor market indicators.
(2) Keep the record with more labor market information (i.e., with less variables containing missing responses). As much as possible, we would not want the record that has more to offer.
(3) If 1 and 2 does not apply, then keep the first instance.

The problem with this approach comes when merging two individual-level datasets. The kept records in one dataset may not be the exact match for the other dataset. But again, I just rely on the fact that this is a small fraction of the dataset so this is not going to be very problematic. As agreed with Mario, it would be important to note that caveat in the do file.

The same issue is found in the household-level dataset as can be seen below. Again, most of them are in West Bengal as well. Rules (2) and (3) above are applied in selecting the record to keep.

image

Verify number of observations per round

Following up #47 , we note that not all observations in some survey rounds are classified under the month to which the round corresponds. For example, hypothetically if the observations in round in April 2010 were tabulated by the month variable

tab [month_variable], missing

We see something like this:

Month Percent
January 0.1 %
April 98.9 %
July 1.0 %

Thus, we have to determine if the non-corresponding observations are reallocated to other rounds or if they are nevertheless included in the survey round in which they appear. (Example, do the "July" and "January" observations get counted in the April found?)

Check that the number of observations per round indicated by the appropriate survey month reasonably corresponds with the total the PSA publishes.

migrated_years variable in the GLD

I want to seek anyone's thoughts about the variable, "migrated_years" in the GLD codebook. Should the response to number of years since last migrated be limited only to people who claimed that they have migrated during the reference period? If that is the case, then all the responses we will get for migrated_years will be equal to migrated_ref_time, which is the "reference time applied to migration questions." The following lines of code are from the do file template:

*<_migrated_ref_time_>
	gen migrated_ref_time = 1
	label var migrated_ref_time "Reference time applied to migration questions"
*</_migrated_ref_time_>

*<_migrated_binary_>
	destring B3_q8, gen(migrated_binary)
	recode migrated_binary (2 = 0)
	label de lblmigrated_binary 0 "No" 1 "Yes"
	label values migrated_binary lblmigrated_binary
	label var migrated_binary "Individual has migrated"
*</_migrated_binary_>


*<_migrated_years_>
	gen migrated_years = .
	destring B31_c16, gen(helper_var)
	replace migrated_years = helper_var if migrated_binary == 1
	label var migrated_years "Years since latest migration"
	drop helper_var
*</_migrated_years_>


recode variables in I2D2

In making GLD I realized that the top-value 99999 for "not reported" needs to be recoded as . a la

	replace 		wage_no_compen = . if 	wage_no_compen == 99999

Consider `njobs` variable

Look into whether the data for this variable should even be included at all in the dataset. Consider:

  • coverage in rounds and years
  • correlation with industry occupation lstatus and other labor variables

reg0x naming system

The reg01-4 variables in the survey refer to geographic divisions. reg02, reg03, reg04, all refer to the standard admin1, admin2 and admin3 levels. reg02 and reg01 can at times refer to the same variable.

The naming nomenclature is as follows:
reg01: preferred geographic area of analysis. This could refer to an ambiguous, large conceptual area (such as the "northeast") or a granular area such as the urban/rural divide within a third level disaggregation.
reg02: the first level of geographic administrative disaggregation
reg03: the second level of geographic administrative disaggregation
reg04: the third level of geographic administrative disaggregation

Incorporate historical GAUL data

The goal is to:

  • Match any (or all) given GLD country/economy's administrative region to GAUL data, and
  • Do this across all survey years, applicable and valid for that year.

We need to figure out if this is possible and how one would go about doing this.

Useful links and ideas may include:

  • the countrycodes R package (which may interact with database with GAUL data, but not sure?)
  • The UN Food and Agriculture Organization's page on the GAUL data. This discusses how to access the historical data and licensing.

Create labels for Province (Admin 2)

It's time to properly ensure the labels for the Second Administrative Division are in good shape across all years. As with Region (Admin 1), make sure that there's ample documentation in md format if necessary. Ensure:

  • label consistency
  • document any administrative changes
  • track and changes in values

fix Rmd rendering on edu md

tables created by R Studio's visual editor end up looking weird. Consider finding a solution or submitting an issue on github?

I2D2/GLD Harmonization Across all surveys?

A two-part question/idea,

@alexandraqn @giofsantos11 @gronert-m @Junying-T

  1. Should we remove the value labels that were defined in the incoming/raw data and only leave the value labels that we use? For example, an "urban" variable could be labeled as "0 Rural 1 Urban", but we change it to the label specified in the I2D2 or GLD dictionary:
label define      urban     "1 Urban 2 Rural"

Shouldn't the original urban value label be removed explicitly? The resulting .dta file will now have two urban value labels: one is used in the dataset and one is irrelevant. This could be confusing to the user.

This made me think of another idea:

  1. Since all of our final value labels for I2D2 or GLD should be the same, why could we not have one central place to pull these value labels from and avoid a) retyping them every time, and b) potentially committing errors as we type them? This way we know that every I2D2 or GLD value labels are the same across all datasets. This could be implemented in the form of a central .do or .ado file that literally just defines all the value labels by hand or reads it from the codebook xlsx files -- kept next to our code folders and called in our code (or even just copy-paste). (I guess you could have the script apply the labels as well, since you always know that the "urban" label will be applied to the urb I2D2 variable, you would just run it at the end of your code after all variables are made)

Thoughts?

GLD questions

  • ICLS-## 13, if no distinguishing between end goal of productivity type
  • urbanity as secondary stratification? no
  • still no labels for second administrative level? yes, try to harmonize similar to region. seek help from carto
  • how important is the GAUL data? It may be able to be imported and merged via wb poly data. would be nice. long term project. make separate issue
  • use constructed hhsize variable or given, considering the construction of hhids. yes
  • #82
  • #80
  • are we restricting these (ie edu) explicitly by age as in i2d2? yes.
  • edu isced codes? yes, try to map

Copy md guides

The I2D2 guides that are relevant to GLD should be copied and updated for GLD in the Support Folder next to ENOE

Create Reprex for Harmonizing Data Labels

Create a Reproducible Example that addresses #17 with data and code that can be easily shared outside of the project. Do this so external collaborators can get a sense of the problem and potentially contribute. I would love to have this code somehow one day work its way into iecodebook because it naturally solves a complimentary problem.

Comments on MEX 2005 GLD

@alexandraqn I wanted to comment on the version you have created for MEX 2005 GLD that now incorporates the ISIC version after merging in the information from the NAICS-ISIC correspondance. I do it here because I can link to specific lines, which makes it visually clearer.

The first thing I wanted to point to is the repetition of the local path_in. as you can see below, path_in is defined in line 60 but then in line 70 and 72. This instances are not needed (if run all at a time - if you run a bit of the code from the do file, the next time you run something from there all previously created locals will have been wiped).

local path_in "C:\Users\wb582018\OneDrive - WBG\Surveys\MEX\MEX_2005_LFS\MEX_2005_LFS_v01_M\Data\Original"
local path_output "C:\Users\wb582018\OneDrive - WBG\Surveys\MEX\MEX_2005_LFS\MEX_2005_LFS_v01_M_v01_A_GLD\Data\Harmonized"
*----------1.3: Database assembly------------------------------*
* All steps necessary to merge datasets (if several) to have all elements needed to produce
* harmonized output in a single file
use "`path_in'\VIVT105.dta",clear
drop p1-p3
destring loc mun est ageb t_loc cd_a upm d_sem n_pro_viv ent con v_sel n_ent per, replace
local path_in "C:\Users\wb582018\OneDrive - WBG\Surveys\MEX\MEX_2005_LFS\MEX_2005_LFS_v01_M\Data\Original"
merge 1:m ent con v_sel using "`path_in'\HOGT105.dta", nogen
local path_in "C:\Users\wb582018\OneDrive - WBG\Surveys\MEX\MEX_2005_LFS\MEX_2005_LFS_v01_M\Data\Original"
merge 1:m ent con v_sel n_hog using "`path_in'\SDEMT105.dta"

The second issue refers to the location of the file. As seen below lines 92 and 98 define where the the NAICS-ISIC file is read from. This should be ...\MEX_2005_ENOE\MEX_2005_ENOE_v01_M\Data\Stata. Recall what we discussed yesterday, the INEGI excel file is to be under Doc, the R code that creates the .dta file under Programs and the output of the R code under Data\Stata. Furthermore, on the GLD-Harmonization system, please ensure that Stata files are under Data\Stata and not Data\Original - I think you have done it for all but 2005.

*ISIC
***first job
gen scian_1=p4a
tostring scian_1, replace
merge m:1 scian_1 using "C:\Users\wb582018\OneDrive - WBG\Documents\Industry Classification\SCIAN_02_ISIC_3.1\SCIAN_02_ISIC_3.1.dta", keep(master match)
rename scian_1 scian_11
drop _merge
***second job
gen scian_1=p7c
tostring scian_1, replace
merge m:1 scian_1 using "C:\Users\wb582018\OneDrive - WBG\Documents\Industry Classification\SCIAN_02_ISIC_3.1\SCIAN_02_ISIC_3.1.dta", keep(master match)
rename scian_1 scian_2
drop _merge

The third issue relates to the same place, in particular to lines 92 and 98 more in detail. The lines are:

merge m:1 scian_1 using "C:\Users\wb582018\OneDrive - WBG\Documents\Industry Classification\SCIAN_02_ISIC_3.1\SCIAN_02_ISIC_3.1.dta", keep(master match)

merge m:1 scian_1 using "C:\Users\wb582018\OneDrive - WBG\Documents\Industry Classification\SCIAN_02_ISIC_3.1\SCIAN_02_ISIC_3.1.dta", keep(master match)

Previously you have created scian_1 by doing first gen scian_1=p4a and then tostring scian_1, replace. You can, by the way, do this in one command as tostring p4a, gen(scian_1). So now you have on your master file variables scian and scian_1, while the NAICS-ISIC file contains variable scian.

Your merge command merge m:1 scian_1 using "file path", keep(match master) does not work for me. Have you changed the R file so that SCIAN_02_ISIC_3.1.dta has a variable scian_1 instead of scian?

I propose the following, which did work for me:

*ISIC	
***first job
	rename scian scian_orig // get rid of the scian that was already there in the raw data
	tostring p4a, gen(scian)
	merge m:1 scian using "C:\Users\wb529026\OneDrive - WBG\Documents\Country Work\MEX\Industry Classification\SCIAN_02_ISIC_3.1.dta", keep(master match) nogen
	rename scian scian_1

***second job
	tostring p7c, gen(scian)
	merge m:1 scian using "C:\Users\wb529026\OneDrive - WBG\Documents\Country Work\MEX\Industry Classification\SCIAN_02_ISIC_3.1.dta", keep(master match) nogen
	rename scian scian_2

Note that my using "path file" here in the merge commands is also incorrect as the SCIAN_02_ISIC_3.1.dta file is not under Data\Stata.

While I am here, a final point on variable abbreviation. The definition of male is based on the original variable sexo which in your code is just sex. This works because Stata allows by default variable abbreviation.

*<_male_>
gen male = sex
recode male 2=0
label var male "Sex - Ind is male"
la de lblmale 1 "Male" 0 "Female"
label values male lblmale
*</_male_>

I think it is best to always reference the full variable. I can even imagine you wanted to type sexo but it just happened because you were typing quickly and this goes unnoticed because Stata is - inadvertently - making assumptions about what it thinks you meant. You should be in full control of this process.

Let me know of any questions you may have and thanks a bunch for this!

Check/Fix error in Industry Codes

At least in the post 2009 classification schema, for years starting in 2012, the code should read

	replace industry=3 if (`raw`>=10 & `raw`<=33)	// to Manufacturing
	replace industry=4 if (`raw`>=35 & `raw`<=39)	// to Public utility

Check that this is the case, and make sure the pre-2009 follows the cognate code.

Household and Individual IDs

Since I'm not provided with how many HH are (the PSA and ILO do provide numbers for how many individual cases there are, but not households), I can't find a way to confirm how many Households there are in each survey year. My own variable coding has resulted in some years with drastically fewer households than others, which means I should review the variable coding and ensure that I'm extra confident in the number of households that it produces.

Bibliography/keeping track of links and public references

A random thought/idea in my traveling to various links and documentation sites: does it make sense to have one centralized place to collect all of our key public/internet references and links, and then publish as a "Reference" document? Is this common practice or even a good idea anyway?

I'm not sure how one would do this, my only idea is to keep a BibTeX file -- and I guess keep it on github -- and then add to along the way like another code file and render an output file when we need to. I've never done this, though. Thoughts? @gronert-m @alexandraqn @Junying-T @giofsantos11

Inconsistent weight variable for some years

Some years (listed below) do not have a consistently-constructed weight variable across all rounds. What this means is that, when appended, some rounds' observations are disproportionately scaled. There also doesn't seem to be an obvious way to scale or manipulate the weight variable such that it can be harmonized: since some rounds do have multiple versions of the weight variable, we can test to see if a new scaled_weight_var is equal to an existing, "correct" final_weight_var

gen scaled_weight_var = wrongly_scaled_weight_var*10000
assert scaled_weight_var == final_weight_var

However, the assertion is wrong on ~30,000 observations, and a visual check shows that the translation between different weight variables is clearly not simply linear.

This issue applies to years: 2005, 2007

Furthermore, for 2008, there's only 1 internal weight variable in each round -- so no opportunity to compare -- but the January weight variable has a mean scaled much higher. Should I just scaled January down to the other rounds in this case without any way to confirm?

Furthermore, for 2009-2011, the problem is similar to 2008, except the only consistent weight variable is just called weight. There is indeed a fwgt or "Final weight" variable which is inconsistently coded across years, but the documentation is clear that this is the variable to use -- it says to use the variable indicated as "final weight". However, in 2012-13, the only weight variable available is not fwgt but weight, which leads me to believe that the latter may actually be the final weight variable even thought weight is not actually labeled as "final weight". Further complicating matters is that in 2014 onward, the variable labelled "Final weight' is in fact the correct, lower scaled, harmonized, weight variable. I think in summary, everything is inconsistent.

So if all the rounds do not have the compatible weight variables, what should I do?

Urban / Rural distinction in ENOE data

The current coding for urban and rural (from what I saw it is in all years) seems to be incorrect. The data checks already alert to this as, for example, for 2005, the survey urbanization rate is 62% while the World Bank data is 76.3%.

The current code is as follows:

*<_urban_>
gen byte urban=.
replace urban=1 if t_loc==1
replace urban=1 if t_loc==2
replace urban=0 if t_loc==3
replace urban=0 if t_loc==4
label var urban "Location is urban"
la de lblurban 1 "Urban" 0 "Rural"
label values urban lblurban
*</_urban_>

That is, locations of at least 15,000 (t_loc is 1 or 2) are deemed urban, those with fewer are said to be rural. The official definition is that a locations with a population with fewer than 2,500 are rural (images below are from the ENOE glossary and the ENOE teaching resources websites)

image

image

If I run this, for 2014 I get an urbanization rate of 76.88% and for 2005 of 76.6%

image

This is a smaller increase than reported in the World Bank Data (where it goes from 76.3% to 78.9%, see image below), but is much closer to the reported numbers - and uses the official definition.

image

Lastly, just to avoid confusion, note that - at least in 2005 - the "VIV" data set has a variable ur labelled as "Urbano/Rural", but as can be seen, this cuts across all locality sizes and would lead to even lower urbanization rates.

image

In closing, the definition of urban (and indeed urb in I2D2, too) needs to change to be

*<_urban_> 
 	gen byte urban=.  
 	replace urban=1 if inrange(t_loc,1,3)
 	replace urban=0 if t_loc==4 
 	label var urban "Location is urban" 
 	la de lblurban 1 "Urban" 0 "Rural" 
 	label values urban lblurban 
 *</_urban_> 

GLD questions

  • vocational training variable - meant to capture pre-professional training?
  • annualisation variables: when can we create manually and when should we not (ie t_hours_total vs t_hours_annual) in section 8.10
  • can we assume the following about wages:
t_wage_nocompen_total_year <= t_wage_total_year == linc_nc == laborincome == total_wage 
// if there's no information on extra compenstation, etc ?
  • add labor status to min age at end?
  • Why does PSIC 1994 give the ISIC 3.1 conversion if ISIC 3.1 was released in 2002?
  • isic/isco codes
  • Dplyr filtering across all and replace with multiple functions, see here
  • Assumption that we want the appropriate key in each year's /Work directory
  • For isco/isic, only match data given in 4-5 digits?

industrycat_10 assignment of Transport and Communication

I wanted to highlight the assignment of the category 7 (Transport and Communication) and 8 (Financial and Business Services) for versions based on ISIC4 (that is surveys from 2008 on for ENOE).

You currently have:

gen industrycat10= substr(industrycat_isic, 1,2)
gen industrycat10_helper=.
replace industrycat10_helper=1 if industrycat10=="01" | industrycat10=="02" | industrycat10=="03"
replace industrycat10_helper=2 if industrycat10=="05" | industrycat10=="06" | industrycat10=="07" | industrycat10=="08" | industrycat10=="09"
replace industrycat10_helper=3 if industrycat10=="10" | industrycat10=="11" | industrycat10=="12" |industrycat10=="13" | industrycat10=="14" | industrycat10=="15" | industrycat10=="16" | industrycat10=="17" | industrycat10=="18" | industrycat10=="19" | industrycat10=="20" | industrycat10=="21" | industrycat10=="22" | industrycat10=="23" | industrycat10=="24" | industrycat10=="25" | industrycat10=="26" | industrycat10=="27" | industrycat10=="28" | industrycat10=="29" | industrycat10=="30" | industrycat10=="31" | industrycat10=="32" | industrycat10=="33"
replace industrycat10_helper=4 if industrycat10=="35" | industrycat10=="36" | industrycat10=="37" | industrycat10=="38" | industrycat10=="39"
replace industrycat10_helper=5 if industrycat10=="41" | industrycat10=="42" | industrycat10=="43"
replace industrycat10_helper=6 if industrycat10=="45" | industrycat10=="46" | industrycat10=="47" | industrycat10=="55" | industrycat10=="56"
replace industrycat10_helper=7 if industrycat10=="49" | industrycat10=="50" | industrycat10=="51" | industrycat10=="52" | industrycat10=="53"
replace industrycat10_helper=8 if industrycat10=="64" | industrycat10=="65" | industrycat10=="66" |industrycat10=="77" | industrycat10=="78" | industrycat10=="79" | industrycat10=="80" | industrycat10=="81" | industrycat10=="82"
replace industrycat10_helper=9 if industrycat10=="84"
replace industrycat10_helper=10 if industrycat10=="58" | industrycat10=="59" | industrycat10=="60" |industrycat10=="61" | industrycat10=="62" | industrycat10=="63" | industrycat10=="68" | industrycat10=="69"| industrycat10=="70" | industrycat10=="71" | industrycat10=="72" | industrycat10=="73" | industrycat10=="74" | industrycat10=="75" | industrycat10=="85" | industrycat10=="86" | industrycat10=="87" | industrycat10=="88" | industrycat10=="90" | industrycat10=="91" | industrycat10=="92" | industrycat10=="93" | industrycat10=="94" | industrycat10=="95" | industrycat10=="96" | industrycat10=="97" | industrycat10=="98" | industrycat10=="99"
replace industrycat10_helper=. if lstatus!=1
drop industrycat10
rename industrycat10_helper industrycat10

That is, category 7 is only values coding values 49 to 53, which is the category Transportation and storage. The codes corresponding to Information and communication (codes 58 to 63) are assigned to the category 10 (Other services, unspecified). Similarly, category 8 includes real estate as well as professional, scientific, and technical activities (codes 68 to 75) which you also have as category Other.. This needs updating in all surveys.

Additionally, as a recommendation, I would suggest converting the first digits to integer (industrycat_isic needs to be string of length four, but your helpers can be what serves your best interest). I suggest the following which I find easier to read (but is a personal preference, so feel free to ignore this suggestion).

gen helper = substr(industrycat_isic, 1,2)
destring helper, replace
gen industrycat10 = .
[... other lines ...]
replace industrycat10 = 6 if inrange(helper, 45, 47) | inrange(helper, 55, 56) 
replace industrycat10 = 7 if inrange(helper, 49, 53) | inrange(helper, 58, 63) 
replace industrycat10 = 8 if inrange(helper, 64, 82)
replace industrycat10 = 9 if helper == 84
replace industrycat10 = 10 if inrange(helper, 85, 99)
[... other lines ...]
drop helper

Use given month variable or create own?

I've been using the given month variable in the survey data for consistency, but sometimes the data don't really make sense. For example, April 2017's lmonth shows data from January and October. It just so happens that I need a variable to say only "April" for coding reasons for this year. But a larger question -- how should we generate "round" or "wave" variable data? Should this data come from the dataset itself -- with potential quirks and all -- or should we generate this variable manually?

read_pdf() group differences when line y values slightly differ

The function depends on pivoting across groups, which are determined by the same y value or vertical position on the page. In a few cases, values in the same row actually vary by 1 unit so the function won't detect them as the same group. This will have to be fixed by some sort of automatic grouping process, otherwise these values will be lost.

GLD international schemas to match

An overall list of International/UN schemas to-match

  • ISCED: for educat_isced
  • ISIC: (Industry) for industrycat_isic industrycat_isic_2 industrycat_isic_year industrycat_isic_2_year
  • ISCO: (Occupation) for occup_isco occup_isco_2 occup_isco_year occup_isco_2_year

Variable Factor Level differences within year

An initial investigation of the first ~13 years / half of the data suggests that some years have vast discrepancies in how variables are named, labelled, and value labelled, which obviously creates serious problems for appending rounds within the same year.

  • years 1997-2002 have near perfect matches. Actually most years only have one data file.
  • years 2003-2007 have almost no matches across rounds/files.
  • years 2008-2010 have about 85% matches across rounds/files. It seems that most these misalignments are variables that are superfluous and can be dropped.

iecodebook append can fix most of these situations by using the , match option, which finds variables of the same name and aligns them. But using iecodebook alone can't determine which variables match from round to round that are actually the same but named differently (unless it's obvious and off by one or two characters which throws off iecodebook) so lots of manual checking is required. Even so, this doesn't fix the following:

  • The manual checking will take time. We have to be 100% sure that once we append and recode that it's right because once the data are changed there's no way in the dataset post-facto to verify this, other than going through the code history in the repo.
  • To prove my point about a need to harmonize data labels, these are a sample of the data labels for region: confirmed on the PSA's website
    1
    2
    3
    4
    5
    (...omitted for brevity...)
    16

and then for the second half,
1
2
3
5
(...omitted for brevity...)
16
41
42

Improving read_pdf with header text

read_pdf() does work will as a somewhat generalized function, but it assumes that there is a header on each page. This is not the case with the 1994 ISCO codes, which have ~98% of pages without a header, but a few do.

We need to find a way to allow for function to read the info on pages where there is no header, because otherwise the info just gets cut off automatically. My initial thoughts are:

  1. create a boolean toggle that tells the function whether or not there is a permanent header on each page.
  2. if toggle == FALSE, then the function will read data across the entire y span of the page, then filter on the following conditions later:
  • the data are alpha (removes words)
  • the data match given numeric parameters by the user supplied in the original function call and are above a certain y bounds, also supplied by the user. Cases would be a numeric year that is named in a variable or "version1.0".

It's very unlikely, but not impossible, that under these conditions the function could filter out real data. The numbers that usually appear in header titles would have to be the exact same numbers that appear in the data and in the top ~5% of the page or so.

Length of variable labels

Hi @gronert-m @alexandraqn @Junying-T , my editor highlighted for me in red that some variable labels strings are longer than the specified maximum length (80 characters). This is coming from the newest version of the GLD template. I'm not sure what happens, but I think Stata just takes the first 80 characters. In any case, shouldn't we shorten them to 80 or fewer? What should they be? And I can change them in the template.

industrycat4                        "1 digit industry classification (Broad Economic Activities), primary job 7 day recall" // 85 chr
industrycat4_2                    "1 digit industry classification (Broad Economic Activities), secondary job 7 day recall" // 87 chr 
t_wage_nocompen_others "Annualized wage in all but primary & secondary jobs excl. bonuses, etc. 7 day recall" // 84 chr
industrycat4_year               "1 digit industry classification (Broad Economic Activities), primary job 12 month recall" // 88 chr
industrycat4_2_year           "1 digit industry classification (Broad Economic Activities), secondary job 12 month recall" // 90 chr
t_wage_nocompen_others_year "Annualized wage in all but primary & secondary jobs excl. bonuses, etc. 12 month recall)" // 88 chr

PHL Bulletins for the latest years

The Philippines Statistics Authority (PSA) has a website listing LFS Bulletins which function as Basic Information Document, containing all relevant details. These are key to documenting the survey for users.

The records are really good until 2015. For the 20 surveys for the years 2016 to 2020 there are but three bulletins. I have been in touch with the PSA (see email exchange at the end of this issue) but kindly ask you @buscandoaverroes to also email the Income and Employment Statistics Division, putting me in copy and asking for the release of all the remaining bulletins.

In addition, please explain this issue to Arianna. She may know more about where to find the missing bulletins or - if she did not know about the bulletins - it is a great source we should make her aware of.

Region factor levels/coding

I noticed that the original code in PHL 2015 recodes some of the reg or what becomes reg01 / reg02 because it appears that the raw factor codes split up an existing region into "sub-regions" a and b. The issue may appear elsewhere in the project. The original code

	gen byte reg01=reg
	recode reg01 (41=4)  (42=4) (17=4)

because using codebook reg you discover that there is no value for reg=4 and reg=41 and reg=42 both refer to region 4A and B respectively. This poses two major questions/issues:

  1. Can this be fully resolved via a script as planned?
  2. What should actually be done with the distinct factor labels? Can we assume that the subregions should go under the missing region, and should this be done for all similar cases?

Concat ID

Example of problem:

Number obs var_forID
1 4
1 44
1 444
1 4444

when I use var as part of egen(concat) it creates an ID with uneven number of digits.

PHL debug to-do list

  • drop all empty variables even if in codebook, do after order
  • figure out a unique HHID for years 2017 - ?
  • drop unused labels
  • ensure in surveys with different employment status per year that cempst2 -> newempst and cempst1 -> empst1
  • #66
  • for some years, compare population and labor force participation to official numbers
  • add "currently attending" variable for 2004-19. See label script for table object.
  • fix multi-round month (april) for 2017, see #52

ZAF To do list 7.2 - 7.9

ZAF To do list 7.2 - 7.9

  • Check and document # of observations and households dropped during HH assignment for 2008-2019
  • Amend value labels for Province in 2008 and 2009 (the raw dataset wrongly attached value labels of month to Province)
  • Recode Q42OCCUPATION in 2008 to follow the coding rule of SASCO 2003
  • Document versions of industry and occupation classification used for each year; matching national classification to international standards
  • Check category 9999 of Q42OCCUPATION; it should be missing for occup as 9999 is unspecified

Few observations without household head

My I2D2 checks have caught a few observations without a household head in one year. Since we discussed this earlier, I'd love to hear what others have to say. @gronert-m @alexandraqn @Junying-T @giofsantos11

This is the current state of non-expected household head observations by year

year observations No. HH head No. Distinct HH ID
1997 4 0 1
2004 4 0 1
2007 13 0 3
2017 11 0 2

In this household, and in my data more broadly, I don't feel immediately comfortable declaring a household head by rule. The reason for this, primarily, is because I've had to construct the household IDs myself in most surveys; the raw data did not always provide them. While I'm confident that the Household IDs are quite close to the actual households in most cases, this 'backward' construction introduces another layer of potential ambiguity. I worried about declaring a household head in a household that I constructed post facto.

My thoughts are to either:

  1. Leave households without a household head as-is.
  2. Chose a household head at random (reproducibly at random).
  3. Drop all observations in Households without household head, if these cases fall below a certain threshold (0.01%)

My initial preference is for 1, since this leaves the user with the most freedom. What are your thoughts -- and to what extent should this situation be considered differently than with ZAF?

I2D2 checks assume that njobs and unitwage data are in sync but actually this is not true in data

In the I2D2 check we check that

di "check unitwage_2 only for njobs>0"
  3. assert unitwage_2==. if( njobs==0 | njobs==.)  & jobs_var == 1         // only perform if njobs exists
  4. }

But this is not true on about 20% of observations, meaning that in 20% of cases there's a non-missing value for unitwage_2 where the data say there should be "no" jobs. This hasn't been tested for unitwage, but the same scenario would happen.

The problem is that the assumption behind this check does not apply to this dataset in particular. I think the check assumes that njobs, unitwage and industry and other job labor variables are in sync. That is, if I have 1 job in injustry a, that is represented in unitwage and in industry etc. But it appears that, in this dataset, that njobs is not consistent with other data reported in other variables. This is true in other rounds and in other years. I think maybe the best solution is simply to remove this check here, because we just have to consider njobs as what is reported.

Duplicates in ZAF2020 QLFS Quarter 2

There are 668 duplicated observations (or 169 households) in quarter two of 2020, which might result from a coding error of the variable UQNO. They account for 0.32% of total sample size These observations cannot be uniquely defined by household number and personal number within quarter two, but most of them can be uniquely defined by all variables.

duplicates report _all
duplicates command output

In other words, only 5 observations out of all observations in the dataset are “truly” duplicated. These 5 observations have identical information for all variables.

duplicates tag

The others are “partially” duplicated. They have same UQNO and PERSONNO but in fact are referring to different individuals. Pointed out by Mario, gender, population group, and age are different for the same person (in terms of PERSONNO) in the same household as shown in the picture, taking household #199103740000006016 as an example.

image

It seems that lines 1047, 1050, and 1051 form one household, while 1048, 1049, 1052, and 1053 form another. If that is the case, then the codes of UQNO should be corrected.

Add "trim" argument in `best_match()`

In my use case I don't need this, but to make the function more applicable, we should add a trim=[int1] argument. This would allow the user to trim the column with some sort of stringr function to make a column match the length of the international code column on the fly. This is done in the original code anyways.

Update in filenaming convention

A conversation with the team running Datalibweb has highlighted that file names need to be updated.

I have updated the guidelines accordingly (in commit 4a1f0f4) requesting the name of

  • harmonization .do files and
  • harmonized .dta files

So for example a name like MEX_2007_ENOE_V01_M_V01_A_GLD.dta needs to change to MEX_2007_ENOE_V01_M_V01_A_GLD_ALL.dta where the ALL part is so the query dlw, country() year() type(gld) mod(ALL) command of datalibweb can correctly read the information contained in the name.

That is in the guidelines and the way forward. Thinking about now it is enough to only amend the .dta files. Since this is fairly mechanical, as I need to do this for Angelo's data I wrote a little program to automate the process. You can find it in this gist, but we can talk about it tomorrow at the weekly meeting (or next week in the case of Alex, who is on leave for this meeting).

Edit

We need to change the names from the start to both .do and .dta files.

Differences in actual minimum module ages within same year

In theory, the minimum applicable ages for the labor and the education modules should be same within the same year (unless a change is documented). But there are a few years in the data that show one particular round with much lower age data than the rest of the rounds in the same year. See year 2017 with pufc23_pclass as an example. I want to talk more about this during meetings but for documentation purposes I also want to record the issue here.

What should happen in cases where rounds in the same year display different age minimums? My instinct is to simply document this and code the year's minimum age for the module as the lowest age recorded in any wave that year-module. Doing this would preserve all observations.

I2D2 Check

Mini issue/fix tracker based on feedback from I2D2 checks. Will expand into individual issues if necessary.

  • #70
  • 1998: unempl_u not satisfy check with lstatus and age
  • error thrown because occup_orig is string, not numeric. create filter with ds, has(type numeric)
  • edit the logic for njobs to account for valid situations where empstat_2 etc != . but njobs = .
  • #72

Matching multiple old values to current ISIC/ISCO value

The function in international_codes.R does generally produce satisfactory results. However, there's a conceptual issue with the R code that arises where multiple ISIC or ISCO codes from a previous year's schema match to a current year's code. Here's an example table:

Table 1

This year's code Description Old Version 0.1 Codes
1234 A description 0001, 0002, 0003

This would be great, except the PDF actually has these values in separate rows, with empty info in other cells, like below:

Table 2

This year's code Description Old Version 0.1 Codes
1234 A description 0001
0002
0003

The human is able to tell that the values in empty cells belong to the most recent non-empty cells, but this is very diffcult for the computer where relational data is lost. read_pdf() function does actually read all "old" values, but the problem is that the end up in separate rows (which is exactly how the appear in the raw PDF). In a sense, the function works too well, because now we have empty rows with no idea really where they go -- they're just sort of lost in the massive tibble that gets imported.

I'm posting the issue here for brainstorming and problem-solving -- we need to figure out a way to keep all old values associated with the new value.

Education levels recoding + Documentation

For most years, the raw variable for highest level of education has factor levels that are self-explanatory. But for some years, the factor levels are quite extensive and require documentation for proper interpretation that I can't find. Neither the ILO, PSA technical note, nor the PSA documentation provide an explanation of the education levels (ie, what "college" vs "post-secondary" vs "tertiary" education). I'd like to know more before recoding.

2020 Data missing in WDI for mexico - Q checks GLD

Hi @gronert-m ,
When I ran the GLD checks for the 2020 data, I realized that the check stops in section 9 (WDI comparison). After inspection of the WDI merge variables, I realized that WDI does not have up to date data for the variables we want to compare and check, so I decided to comment out section 9 from the static GLD helper Q check template. I wonder (not sure how to start that) if we could build a code so that if the WDI merged variables have all missing observations, the code skips this section. If I am mistaken or I am missing something here please let me know. I am not sure if this also happens for other countries with recent data (2020 onwards).
Cheers,
Alexandra

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.