nycplanning / db-acs Goto Github PK

View Code? Open in Web Editor NEW

4.0 8.0 1.0 9.04 MB

American Community Survey data processing for Population Fact Finder

Jupyter Notebook 78.07% Python 21.93%

acs census-api census-data nyc-opendata

db-acs's Introduction

db-acs

Instructions:

go to https://api.census.gov/data/key_signup.html to sign up for an API key and set environmental variable API_KEY

Run the scripts in order

python3 01_download.py
python3 02_nta.py
python3 03_calculation.py
python3 03.1_special_calculations.py
python3 03.2_recalculate.py
python3 05_pivot.py

The final outputs will be saved as the following:

demo_final_pivoted.csv  
econ_final_pivoted.csv  
hous_final_pivoted.csv  
soci_final_pivoted.csv

db-acs's People

Contributors

Stargazers

Watchers

Forkers

jschrie3

db-acs's Issues

Change intermediate table naming -- low priority

Intermediate tables have suffixes that aren't informative: no suffix, _intermediate, _final, _final1, _final2, _final_pivoted

Proposed naming convention:

_acs for the table containing concatenated, downloaded ACS variables for a particular category (currently has no suffix)
_nta_sums for the table of nta-level aggregated estimates and MOEs (currently _intermediate)
_nta_combos for the table of nta-level aggregated estimates and MOEs, which includes new variables that are combinations of other variables -- either sums or proportions -- and checks for controlled variables (currently _final)
_nta_meds for the table of nta-level aggregated estimates and MOEs, which includes interpolated medians and special variables (currently _final1)
_nta_recalc for tables of nta-level aggregated estimates and MOEs, which have been recalculated in case the denominators have changed in the previous two tables (currently _final2)
_nta_pivoted for tables that have complete, nta-level aggregated estimates and MOEs, where the columns refer to the geography type/ID, which table and variable is represented, then the appropriate estimate and MOE for that variable (Currently _final_pivoted, which looks like this is a pivot of _final, rather than of _final2)

Don't inflate 2006-2010 ACS values because consuming app should do this

Every year, we will have to re-inflate the 2006-2010 values so they're usable when comparing with new ACS releases.

To make the update process easier, we should have the consuming app adjust 2006-2010 values for inflation: https://github.com/NYCPlanning/labs-factfinder-api

Adding CD level Aggregation for Community Profiles purpose

NYCPlanning/db-community-profiles#15

Get logic on how to calculate MOE values for NTAs from Population Fact Finder

column mapping

in the original meta data there are about 1000 unique columns
but pulling from API, we have 4000+, our data is more granular than previous versions
meta.xlsx
Data Dictionary_0610-1317 database.xlsx

Potential typo with P and E variables in Cv variable

Estimation that involves subtraction

I noticed for some variables such as WrkrNotHm, the estimation involves a subtraction of values instead of addition like most variables,
In this case
WrkrNotHm = DP03_0018E - DP03_0024E

2018 update

Updates:

EA_BchDH is currently being calculated, can now be pulled from API from DP02_0067
Mv10Ltr will be now pulled from DP04_0051, DP04_0052, DP04_0053
Mv90t99 will be now pulled from DP04_0054
Mv80t89 will be changed to MvBf89 and now pulled from DP04_0056
MvBf79 will be removed

refer to 4b144c0 for metadata updates

Catch controlled variables

for variables containing controlled variables,
CV and MOE should be 0 instead of NULL,
need to figure out how to identify them

CvEm16pl2 --> percentage estimate set to 100 for all NTAs

any other variables that need such treatment?

calculation errors on population's end?

In the social table:
for geoid = 1015100 and variable=DfHsUS

population PE: 88.1
EDM PE: 12
population PM: 54.7
EDM PM: 4.6

Based on the variable look up

"DfHsUS": [
        "DP02_0080"
    ],

pulling pre-calculation data:

>>> dff = df.loc[df.GEO_ID=='1400000US36061015100', ['DP02_0080E', 'DP02_0080M', 'DP02_0080PE', 'DP02_0080PM']]
>>> dff
      DP02_0080E  DP02_0080M  DP02_0080PE  DP02_0080PM
2283      1000.0       402.0         12.0          4.6

In this case, we are taking PE and PM directly from ACS without calculation, so it's a mistake from population? (assuming the variable mapping is correct)

2018 update feedback

Demographic:
ok with MdAge differences
Social:
updated disability data in attached file. the data i have for EA_BchDH matches the downloaded data... looks like all of the estimates for this one from the error sheet are doubled
Economic:
there's something off with the CVs for MdNFInc.. is there a separate formula for that variable? the poverty variable errors should be fixed with the updated econ sheet I sent yesterday.
There's large differences in estimates for MN99 and QN99 for MdEMFTWrk
and differences in MOEs for BX98 for MdEMFTWrk and BK99 for MdNFInc
Housing:
BK93 and QN99 large differences in MOE for MdVl, BK99 difference in estimate for MdVl

Make c, e, p, z make NULL for median inputs

Keys in md dictionaries are the fields

percentage estimate set to 100

With a as (
select variable, count(*) as count from pff_social."Y2013-2017-old" 
where p = 100
GROUP BY variable),
b as (select variable, count(*) as count from pff_social."Y2013-2017" 
where p = 100
GROUP BY variable)
SELECT a.variable, a.count as old_count, b.count as new_count 
FROM a,b
WHERE a.variable = b.variable;

From this query we can see that for some variables we have fairly different results for the number of records with 100 as percentage estimate.

Need to figure out if we need to set P=100 for certain variables by definition, if so, how to identify those variables

the majority of the mismatches are caused by differences in P and Z.
we need a look up table so that for each variable, we can find the correct total to calculate the percentages against.
ideally for every variable. proposed structure:

"variable_1":{ "type": "total", "total_variable":"N/A"}, 
"variable_2":{"type": "not total", "total_variable":"variable_1"}

then we will calculate all the total variables first, then the non total variables to get the correct P and Z, it's also convenient because it would give us more accurate P=100

2014-2018 new issues

so to have it all in the same place of the errors i've found so far in my checks:

overall problem of data not appearing where i have data-- database shows 0, api pull shows as null
social: abroad, DfHsDfCnt, DfHsSmCnt, DfHsUS .. the percent and percent MOE columns should be calculated off of DfHs2 and not pulled directly from the census bureau's data (which i think is the problem)
economic: seems that there may be some issues with MdEWrk, MdEFFTWrk, MdEFFTWrk-- missing MOEs and CVs for some, different top-coded values (census bureau has 250,000 as top-coded for their geographies, api database is showing 250,001 (example: 1007900, 1011401, 1006900, etc.))
housing: the issue with the rooms section-- MOE, CV, Percent, Percent MOE appearing as null when they should have values

select * from (
select * from (
select *,
(case when ac is null then 1 else 0 end)+(case when bc is null then 1 else 0 end) as c_,  
(case when ae is null then 1 else 0 end)+(case when be is null then 1 else 0 end) as e_, 
(case when am is null then 1 else 0 end)+(case when bm is null then 1 else 0 end) as m_,
(case when ap is null then 1 else 0 end)+(case when bp is null then 1 else 0 end) as p_,
(case when az is null then 1 else 0 end)+(case when bz is null then 1 else 0 end) as z_
from(
select a.geoid, a.variable, 
a.c as ac, 
a.e as ae,
a.m as am,
a.p as ap,
a.z as az,
b.c as bc, 
b.e as be,
b.m as bm,
b.p as bp,
b.z as bz from 
pff_housing."Y2014-2018" a
JOIN pff_housing."Y2014-2018-erica" b
ON a.variable = b.variable AND a.geoid = b.geoid) c) d
where c_=1 or e_ = 1 or m_=1 or p_=1 or z_=1) e
WHERE bm is not null and am is null;


SELECT * FROM (
SELECT a.geoid,a.variable, 
round(a.c-b.c) as c_diff, 
round(a.e-b.e) as e_diff, 
round(a.m-b.m) as m_diff, 
round(a.p-b.p) as p_diff, 
round(a.z-b.z) as z_diff, 
'' as gap0,
a.c, b.c AS c_,
'' as gap1,
a.e, b.e AS e_,
'' as gap2,
a.m, b.m AS m_,
'' as gap3,
a.p, b.p AS p_,
'' as gap4,
a.z, b.z AS z_
FROM pf."Y2014-2018-erica" a
RIGHT JOIN pff_economic."Y2014-2018" b 
ON a.geoid = b.geoid AND a.variable = b.variable
WHERE a.c != b.c
OR a.c != b.c
OR a.e != b.e
OR a.m != b.m
OR a.p != b.p
OR a.z != b.z) dif
WHERE c_diff != 0
	or e_diff != 0
	or m_diff != 0
	or p_diff != 0
	or z_diff != 0;

Find total variables and make p 100 percent

demo - c,e,m calculated correctly, but not p and z (wrong total?)

Since in this case, e and m are correct, we probably picked the wrong group total to calculate p and z against.
this shows up for both NTA and CT
e.g. geoid 5032300 variable PopU18F

p_pop: 40.4
p_mine: 8.9
z_pop: 16.1
z_mine: 3.6

PopU18F = B05003_014
see metadata: https://api.census.gov/data/2017/acs/acs5/groups/B05003.json

B05003_014 => Total!!Female!!Under 18 years
B05003_001 => Group total

MOEs for zero-estimate data set to null

This happens prior to aggregation. There is still sampling error for zero estimate variables from ACS. Is this substitution correct?

rounding

General rule:

Estimate	General Rule, unless otherwise noted	0
MOE	General Rule, unless otherwise noted	0
CV	General Rule, unless otherwise noted	1
Percent	General Rule, unless otherwise noted	1
Percent MOE	General Rule, unless otherwise noted	1

Special rule:

Name	Rule
MdAge	1 (E & M)
AvgHHSz	2 (E & M)
AvgFmSz	2 (E & M)
MnTrvTm	1 (E & M)
HOVacRt	1 (E & M)
RntVacRt	1 (E & M)
MdRms	1 (E & M)
AvgHHSOOc	2 (E & M)
AvgHHSROc	2 (E & M)

python package for calculating pff

Ideas:

1 metadata file for all different variables with all information we need. e.g.

{
    'pff_variable' : 'pop25t29',
    'acs_variable' : [
            "B01001_035",
            "B01001_011"
        ],
    'domain':'demographic',
    'base_variable':'pop_1',
    'rounding':'2'
}

we should use a existing python package to pull pff data instead of using request, this would make things a lot easier

from census import Census

c = Census(os.environ['API_KEY'])
pd.DataFrame(c.acs5.get(('NAME', ','.join(variables)), 
                 {'for': 'block group:*','in' : 'state:36 county:081'},year=2018)))

we would need to create 1 master spatial lookup table or object. e.g. with both census geoid and boroct

{
    'geotype':'NTA2010',
    'pff_geoname':'BK01',
    'pff_geoid':'BK01',
    'acs_geoid':''
}, 
{
    'geotype':'CT2010',
    'pff_geoname':'QN43',
    'pff_geoid':'4157102',
    'acs_geoid':'36081157102'
}

User experience

from pff import Pff
pff = Pff(api_key='XXXXXXXXX')

# if we just do a NTA level calculation, in the background it should be pulling tract level data and then aggregate to NTA level
pop25t29 = pff.calculate(pff_variable='pop25t29', geotype='NTA', year='2018')
pop25t29.head()

it should be showing the following fields: geotype, geoname, geoid, dataset, variable, c,e,m,p,z

Remove large fake values and make NULL

i.e. -666666666, -8888888

demo - AsnEast : for a lot of demographic variables, we are seeing lots of 11.0 for MOE fields. Should we exclude them???

e.g.

geoid	GEO_ID	B02015_015M	B02015_017M	B02015_011M	B02015_007M	B02015_012M	B02015_020M
2034800	1400000US36005034800	16.0	16.0	16.0	113.0	16.0	16.0
2036000	1400000US36005036000	11.0	11.0	11.0	34.0	11.0	11.0
2032600	1400000US36005032600	11.0	11.0	11.0	188.0	11.0	11.0
2031000	1400000US36005031000	11.0	11.0	11.0	50.0	11.0	11.0
2031600	1400000US36005031600	11.0	11.0	6.0	20.0	11.0	11.0
2031800	1400000US36005031800	11.0	11.0	11.0	20.0	11.0	11.0
2031200	1400000US36005031200	11.0	11.0	11.0	47.0	11.0	11.0
2031400	1400000US36005031400	11.0	11.0	11.0	27.0	11.0	11.0
2034200	1400000US36005034200	11.0	11.0	11.0	11.0	11.0	11.0
2034400	1400000US36005034400	11.0	11.0	11.0	11.0	11.0	11.0
2035000	1400000US36005035000	11.0	11.0	11.0	11.0	11.0	11.0

If the 11s are taken into calculation, then we get MOE for BX21 => 251,
If not, then we get 238, which is closer to what population got (236)

Check that all instances of collapsing variables is summing independent variables

The function get_m in 03_calculation.py assumes that the two variables being collapsed are independent. For example, creating a variable describing the count of 70-74 year olds by summing counts of male 70-74 year olds and female 70-74 year olds is an appropriate usage since these are mutually exclusive.

Check the _meta_lookup.json files to make sure that this is consistently the case.

Social columns wildly off

The following columns have wildly off values

C1864DVsn
C65plDAmb
C1864DCog
C65plDSCr
C1864DILD
C65plDCog
CU18DSCr
C1864DAmb
CU18DAmb
CU18DVsn
C65plDHrg
C65plDVsn
CU18DHrg
C1864DSCr
CU18DCog
C1864DHrg
C65plDILD

In terms of methodology, they are fine, potential explanations:

field mapping is wrong
require special calculation

Y2006-2010 row count doesn't match Y2014-2018

~~In PFF production, the row count for Y2006-2010 demographic data is 380,411:~~

~~select count(dataset) from demographic where dataset = 'Y2006-2010';~~

~~However, in EDM's database, the count for Y2006-2010 demographic is 314,990.~~

~~It seems like there are some missing variables in this output, and I'm not sure why.~~

Still seeing discrepancies in housing and social

RntVacRt

df.loc[df.geotype=='NTA2010','rntvacrte'] = 100*df['vacrnte']/df['rntvacue']
df.loc[df.geotype=='NTA2010','rntvacrtm'] = df.apply(lambda row: hovacrtm(row['rntvacue'], row['vacrnte'], row['vacrntm'], row['rntvacum']), axis=1)

confirm w population that in

rntvacrte = IF(RntVacE=0,"",VacRntE/RntVacE*100)

RntVacE should be RntVacuE

HOVacRt

formula is wrong? confirm with population
given:
HOVacRtE = IF(VacSaleE=0,"",HOVacUE/VacSaleE*100)
but should be:
HOVacRtE = IF(VacSaleE=0,"",VacSaleE/HOVacUE*100)?