Giter Site home page Giter Site logo

db-acs's Introduction

db-acs

Instructions:

  1. go to https://api.census.gov/data/key_signup.html to sign up for an API key and set environmental variable API_KEY

  2. Run the scripts in order

    python3 01_download.py
    python3 02_nta.py
    python3 03_calculation.py
    python3 03.1_special_calculations.py
    python3 03.2_recalculate.py
    python3 05_pivot.py
  3. The final outputs will be saved as the following:

    demo_final_pivoted.csv  
    econ_final_pivoted.csv  
    hous_final_pivoted.csv  
    soci_final_pivoted.csv

db-acs's People

Contributors

mgraber avatar sptkl avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jschrie3

db-acs's Issues

Change intermediate table naming -- low priority

Intermediate tables have suffixes that aren't informative: no suffix, _intermediate, _final, _final1, _final2, _final_pivoted

Proposed naming convention:

  • _acs for the table containing concatenated, downloaded ACS variables for a particular category (currently has no suffix)

  • _nta_sums for the table of nta-level aggregated estimates and MOEs (currently _intermediate)

  • _nta_combos for the table of nta-level aggregated estimates and MOEs, which includes new variables that are combinations of other variables -- either sums or proportions -- and checks for controlled variables (currently _final)

  • _nta_meds for the table of nta-level aggregated estimates and MOEs, which includes interpolated medians and special variables (currently _final1)

  • _nta_recalc for tables of nta-level aggregated estimates and MOEs, which have been recalculated in case the denominators have changed in the previous two tables (currently _final2)

  • _nta_pivoted for tables that have complete, nta-level aggregated estimates and MOEs, where the columns refer to the geography type/ID, which table and variable is represented, then the appropriate estimate and MOE for that variable (Currently _final_pivoted, which looks like this is a pivot of _final, rather than of _final2)

Estimation that involves subtraction

I noticed for some variables such as WrkrNotHm, the estimation involves a subtraction of values instead of addition like most variables,
In this case
WrkrNotHm = DP03_0018E - DP03_0024E

2018 update

Updates:

  • EA_BchDH is currently being calculated, can now be pulled from API from DP02_0067
  • Mv10Ltr will be now pulled from DP04_0051, DP04_0052, DP04_0053
  • Mv90t99 will be now pulled from DP04_0054
  • Mv80t89 will be changed to MvBf89 and now pulled from DP04_0056
  • MvBf79 will be removed

refer to 4b144c0 for metadata updates

Catch controlled variables

for variables containing controlled variables,
CV and MOE should be 0 instead of NULL,
need to figure out how to identify them

calculation errors on population's end?

In the social table:
for geoid = 1015100 and variable=DfHsUS

  • population PE: 88.1
  • EDM PE: 12
  • population PM: 54.7
  • EDM PM: 4.6

Based on the variable look up

"DfHsUS": [
        "DP02_0080"
    ],

pulling pre-calculation data:

>>> dff = df.loc[df.GEO_ID=='1400000US36061015100', ['DP02_0080E', 'DP02_0080M', 'DP02_0080PE', 'DP02_0080PM']]
>>> dff
      DP02_0080E  DP02_0080M  DP02_0080PE  DP02_0080PM
2283      1000.0       402.0         12.0          4.6

In this case, we are taking PE and PM directly from ACS without calculation, so it's a mistake from population? (assuming the variable mapping is correct)

2018 update feedback

  • Demographic:

  • ok with MdAge differences

  • Social:

  • updated disability data in attached file. the data i have for EA_BchDH matches the downloaded data... looks like all of the estimates for this one from the error sheet are doubled

  • Economic:

  • there's something off with the CVs for MdNFInc.. is there a separate formula for that variable? the poverty variable errors should be fixed with the updated econ sheet I sent yesterday.

  • There's large differences in estimates for MN99 and QN99 for MdEMFTWrk

  • and differences in MOEs for BX98 for MdEMFTWrk and BK99 for MdNFInc

  • Housing:

  • BK93 and QN99 large differences in MOE for MdVl, BK99 difference in estimate for MdVl

percentage estimate set to 100

With a as (
select variable, count(*) as count from pff_social."Y2013-2017-old" 
where p = 100
GROUP BY variable),
b as (select variable, count(*) as count from pff_social."Y2013-2017" 
where p = 100
GROUP BY variable)
SELECT a.variable, a.count as old_count, b.count as new_count 
FROM a,b
WHERE a.variable = b.variable;

From this query we can see that for some variables we have fairly different results for the number of records with 100 as percentage estimate.
image

Need to figure out if we need to set P=100 for certain variables by definition, if so, how to identify those variables

New total lookup function

the majority of the mismatches are caused by differences in P and Z.
we need a look up table so that for each variable, we can find the correct total to calculate the percentages against.
ideally for every variable. proposed structure:

"variable_1":{ "type": "total", "total_variable":"N/A"}, 
"variable_2":{"type": "not total", "total_variable":"variable_1"}

then we will calculate all the total variables first, then the non total variables to get the correct P and Z, it's also convenient because it would give us more accurate P=100

2014-2018 new issues

so to have it all in the same place of the errors i've found so far in my checks:

  • overall problem of data not appearing where i have data-- database shows 0, api pull shows as null
  • social: abroad, DfHsDfCnt, DfHsSmCnt, DfHsUS .. the percent and percent MOE columns should be calculated off of DfHs2 and not pulled directly from the census bureau's data (which i think is the problem)
  • economic: seems that there may be some issues with MdEWrk, MdEFFTWrk, MdEFFTWrk-- missing MOEs and CVs for some, different top-coded values (census bureau has 250,000 as top-coded for their geographies, api database is showing 250,001 (example: 1007900, 1011401, 1006900, etc.))
  • housing: the issue with the rooms section-- MOE, CV, Percent, Percent MOE appearing as null when they should have values
select * from (
select * from (
select *,
(case when ac is null then 1 else 0 end)+(case when bc is null then 1 else 0 end) as c_,  
(case when ae is null then 1 else 0 end)+(case when be is null then 1 else 0 end) as e_, 
(case when am is null then 1 else 0 end)+(case when bm is null then 1 else 0 end) as m_,
(case when ap is null then 1 else 0 end)+(case when bp is null then 1 else 0 end) as p_,
(case when az is null then 1 else 0 end)+(case when bz is null then 1 else 0 end) as z_
from(
select a.geoid, a.variable, 
a.c as ac, 
a.e as ae,
a.m as am,
a.p as ap,
a.z as az,
b.c as bc, 
b.e as be,
b.m as bm,
b.p as bp,
b.z as bz from 
pff_housing."Y2014-2018" a
JOIN pff_housing."Y2014-2018-erica" b
ON a.variable = b.variable AND a.geoid = b.geoid) c) d
where c_=1 or e_ = 1 or m_=1 or p_=1 or z_=1) e
WHERE bm is not null and am is null;


SELECT * FROM (
SELECT a.geoid,a.variable, 
round(a.c-b.c) as c_diff, 
round(a.e-b.e) as e_diff, 
round(a.m-b.m) as m_diff, 
round(a.p-b.p) as p_diff, 
round(a.z-b.z) as z_diff, 
'' as gap0,
a.c, b.c AS c_,
'' as gap1,
a.e, b.e AS e_,
'' as gap2,
a.m, b.m AS m_,
'' as gap3,
a.p, b.p AS p_,
'' as gap4,
a.z, b.z AS z_
FROM pf."Y2014-2018-erica" a
RIGHT JOIN pff_economic."Y2014-2018" b 
ON a.geoid = b.geoid AND a.variable = b.variable
WHERE a.c != b.c
OR a.c != b.c
OR a.e != b.e
OR a.m != b.m
OR a.p != b.p
OR a.z != b.z) dif
WHERE c_diff != 0
	or e_diff != 0
	or m_diff != 0
	or p_diff != 0
	or z_diff != 0;

rounding

General rule:

Estimate General Rule, unless otherwise noted 0
MOE General Rule, unless otherwise noted 0
CV General Rule, unless otherwise noted ย  1
Percent General Rule, unless otherwise noted 1
Percent MOE General Rule, unless otherwise noted 1

Special rule:

Name Rule
MdAge 1 (E & M)
AvgHHSz 2 (E & M)
AvgFmSz 2 (E & M)
MnTrvTm 1 (E & M)
HOVacRt 1 (E & M)
RntVacRt 1 (E & M)
MdRms 1 (E & M)
AvgHHSOOc 2 (E & M)
AvgHHSROc 2 (E & M)

python package for calculating pff

Ideas:

  1. 1 metadata file for all different variables with all information we need. e.g.
{
    'pff_variable' : 'pop25t29',
    'acs_variable' : [
            "B01001_035",
            "B01001_011"
        ],
    'domain':'demographic',
    'base_variable':'pop_1',
    'rounding':'2'
}
  1. we should use a existing python package to pull pff data instead of using request, this would make things a lot easier
from census import Census

c = Census(os.environ['API_KEY'])
pd.DataFrame(c.acs5.get(('NAME', ','.join(variables)), 
                 {'for': 'block group:*','in' : 'state:36 county:081'},year=2018)))
  1. we would need to create 1 master spatial lookup table or object. e.g. with both census geoid and boroct
{
    'geotype':'NTA2010',
    'pff_geoname':'BK01',
    'pff_geoid':'BK01',
    'acs_geoid':''
}, 
{
    'geotype':'CT2010',
    'pff_geoname':'QN43',
    'pff_geoid':'4157102',
    'acs_geoid':'36081157102'
}

User experience

from pff import Pff
pff = Pff(api_key='XXXXXXXXX')

# if we just do a NTA level calculation, in the background it should be pulling tract level data and then aggregate to NTA level
pop25t29 = pff.calculate(pff_variable='pop25t29', geotype='NTA', year='2018')
pop25t29.head()

it should be showing the following fields: geotype, geoname, geoid, dataset, variable, c,e,m,p,z

demo - AsnEast : for a lot of demographic variables, we are seeing lots of 11.0 for MOE fields. Should we exclude them???

e.g.

geoid GEO_ID B02015_015M B02015_017M B02015_011M B02015_007M B02015_012M B02015_020M
2034800 1400000US36005034800 16.0 16.0 16.0 113.0 16.0 16.0
2036000 1400000US36005036000 11.0 11.0 11.0 34.0 11.0 11.0
2032600 1400000US36005032600 11.0 11.0 11.0 188.0 11.0 11.0
2031000 1400000US36005031000 11.0 11.0 11.0 50.0 11.0 11.0
2031600 1400000US36005031600 11.0 11.0 6.0 20.0 11.0 11.0
2031800 1400000US36005031800 11.0 11.0 11.0 20.0 11.0 11.0
2031200 1400000US36005031200 11.0 11.0 11.0 47.0 11.0 11.0
2031400 1400000US36005031400 11.0 11.0 11.0 27.0 11.0 11.0
2034200 1400000US36005034200 11.0 11.0 11.0 11.0 11.0 11.0
2034400 1400000US36005034400 11.0 11.0 11.0 11.0 11.0 11.0
2035000 1400000US36005035000 11.0 11.0 11.0 11.0 11.0 11.0

If the 11s are taken into calculation, then we get MOE for BX21 => 251,
If not, then we get 238, which is closer to what population got (236)

Check that all instances of collapsing variables is summing independent variables

The function get_m in 03_calculation.py assumes that the two variables being collapsed are independent. For example, creating a variable describing the count of 70-74 year olds by summing counts of male 70-74 year olds and female 70-74 year olds is an appropriate usage since these are mutually exclusive.

Check the _meta_lookup.json files to make sure that this is consistently the case.

Social columns wildly off

The following columns have wildly off values

C1864DVsn
C65plDAmb
C1864DCog
C65plDSCr
C1864DILD
C65plDCog
CU18DSCr
C1864DAmb
CU18DAmb
CU18DVsn
C65plDHrg
C65plDVsn
CU18DHrg
C1864DSCr
CU18DCog
C1864DHrg
C65plDILD

In terms of methodology, they are fine, potential explanations:

  1. field mapping is wrong
  2. require special calculation

Y2006-2010 row count doesn't match Y2014-2018

In PFF production, the row count for Y2006-2010 demographic data is 380,411:

select count(dataset) from demographic where dataset = 'Y2006-2010';

However, in EDM's database, the count for Y2006-2010 demographic is 314,990.

It seems like there are some missing variables in this output, and I'm not sure why.

Still seeing discrepancies in housing and social

RntVacRt

df.loc[df.geotype=='NTA2010','rntvacrte'] = 100*df['vacrnte']/df['rntvacue']
df.loc[df.geotype=='NTA2010','rntvacrtm'] = df.apply(lambda row: hovacrtm(row['rntvacue'], row['vacrnte'], row['vacrntm'], row['rntvacum']), axis=1)

confirm w population that in

rntvacrte = IF(RntVacE=0,"",VacRntE/RntVacE*100)

RntVacE should be RntVacuE

HOVacRt

formula is wrong? confirm with population
given:
HOVacRtE = IF(VacSaleE=0,"",HOVacUE/VacSaleE*100)
but should be:
HOVacRtE = IF(VacSaleE=0,"",VacSaleE/HOVacUE*100)?

e.g.
image

  • using given we get HOVacRtE = 3627
  • using should be we get HOVacRtE = 2.75

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.