djsutherland / pummeler Goto Github PK

Utilities to analyze ACS PUMS files, especially for distribution regression / ecological inference

License: MIT License

Python 52.45% Jupyter Notebook 47.55%

pummeler's Introduction

pummeler

This is a set of utilities for analyzing the American Community Survey's Public Use Microdata Sample files (ACS PUMS), mostly following

Flaxman, Wang, and Smola. Who Supported Obama in 2012? Ecological Inference through Distribution Regression. KDD 2015. (official, from author)

Usage

Installation

Now requires Python 3.6 or higher.

Use pip install pummeler; it will then be available via import pummler and install a pummel script.

If you prefer, you can also check out the source directory, which should work as long as you put the pummel directory on your sys.path (or start Python from the root of the checkout). In that case you should use the pummel script at the top level of the checkout.

Getting the census data

First, download the data from the Census site. You probably want the "csv_pus.zip" file from whatever distribution you're using. The currently supported options are:

2006-10 (2.1 GB); uses 2000 PUMAs.
2007-11 (2.1 GB); uses 2000 PUMAs.
The 2012-14 subset of the 2010-14 file (2.3 GB); this is the subset using 2010 PUMAs. (Pass --version 2010-14_12-14 to pummel.)
2015 (595 MB); uses 2010 PUMAs.
The 2012-15 subset of the 2011-15 file (2.4GB); this is the subset using 2010 PUMAs. (Pass --version 2011-15_12-15 to pummel.)
2012-16 (2.3 GB); uses 2010 PUMAs.
2013-17 (2.3 GB); uses 2010 PUMAs.
2014-18 (2.1 GB); uses 2010 PUMAs.

It's relatively easy to add support for new versions; see the VERSIONS dictionary in pummeler.reader.

Picking regions of analysis

Election results are generally reported by counties; PUMS data are in their own special Public Use Microdata Areas, which are related but not the same. This module ships with regions that merge all overlapping blockgroups / counties, found with the MABLE/Geocorr tool, in Pandas dataframes stored in pummeler/data/regions.h5.

Regions are named like AL_00_01, which means Alabama's region number 01 in the 2000 geography, or WY_10_02, which is Wyoming's second region in the 2010 geography. There are also "superregions" which merge 2000 and 2010 geographies, named like PA_merged_03.

Note: Alaskan electoral districts are weird. For now, I just lumped all of Alaska into one region.

This was done in the Jupyter notebook notebooks/get regions.ipynb. Centroids are calculated in notebooks/region centroids.ipynb, using a shapefile for counties from here.

TODO: Could switch to precinct-level results, which should end up with more regions in the end. 2012 results are available here, including shapefiles if you go into the state-by-state section, so it shouldn't be too much work there. I haven't found national precinct-level results for the 2016 election yet, but maybe somebody's done it.

Preprocessing

First, we need to sort the features by region, and collect statistics about them so we can do the featurization later.

Run pummel sort --version 2006-10 --voters-only -z csv_pus.zip SORT_DIR. (A few extra options are shown if you pass --help.) This will:

Make a bunch of files in SORT_DIR like feats_AL_00_01.h5, which contain basically the original features (except with the ADJINC adjustment applied to fields that need it to account for inflation) grouped by region. These are stored in HDF5 format with pandas, because it's much faster and takes less disk space than CSVs. (If you only want state-level analysis, --region-type state will make one file per state; --region-type puma will split per PUMA instead of the default puma_county regions.)
Makes a file SORT_DIR/stats.npz containing means and standard deviations of the real-valued features, counts of the different values for the categorical features, and a random sample of all the features.

This will take a while (15 minutes to 2 hours, depending on machine and what processing you're doing) and produce about 4GB of temp data (for the 2006-10 files). Luckily you should only need to do it once per ACS file.

Although --voters-only is simpler if you're directly replicating the Flaxman et al. paper, --all-people is the default: you can replicate the same effect in featurize by adding --subsets 'AGEP >= 18 & CIT != 5'. (If you want to do multiple subsets, at that to each one as appropriate.)

Featurization

Run pummel featurize SORT_DIR. (Again, you have a couple of options shown by --help.) This will get both linear embeddings (i.e. means) and random Fourier feature embeddings for each region, saving the output in SORT_DIR/embeddings.npz.

You can also get features for demographic subsets with e.g. --subsets 'SEX == 2 & AGEP > 45, SEX == 1 & PINCP < 20000'.

NOTE: As it turns out, with this featurization, linear embeddings seem to be comparable to random Fourier feature embeddings. You can save yourself a bunch of time and the world a smidgen of global warming if you skip them with --skip-rbf.

On my laptop (with a quad-core Haswell i7), doing it with random Fourier features takes about an hour; the only-linear version takes about ten minutes. Make sure you're using a numpy linked to a fast multithreaded BLAS (like MKL or OpenBLAS; the easiest way to do this is to use the Anaconda Python distribution, which includes MKL by default); otherwise, this step will be much slower.

If it's using too much memory, decrease --chunksize.

The original paper used Fastfood transforms instead of the default random Fourier features used here, which with a good implementation will be faster. I'm not currently aware of a high-quality, easily-available Python-friendly implementation. A GPU implementation of regular random Fourier features could also help.

SORT_DIR/embeddings.npz, which you can load with np.load, will then have:

emb_lin: the n_regions x n_feats array of feature means.
emb_rff: the n_regions x (2 * n_freq) array of random Fourier feature embeddings.
region_names: the names corresponding to the first axis of the embeddings.
feature_names: the names for each used feature.
freqs: the n_feats x n_freq array of random frequencies for the random Fourier features.
bandwidth: the bandwidth used for selecting the freqs.

(If you did --skip-rbf, emb_rff, freqs, and bandwidth won't be present.)

Getting the election data

This package includes results derived from huffpostdata/election-2012-results, in pummeler/data/2012-by-region.csv.gz. That data was created in notebooks/election data by region.ipynb.

There doesn't seem to be a good publicly-available county-level election results resource for years prior to 2012. If you get some, follow that notebook to get results in a similar format. (Your might have an institutional subscription to CQ Press's election data, for example. That source, though, doesn't use FIPS codes, so it'll be a little more annoying to line up; I might do that at some point.)

TODO: add 2016 election data.

Analysis

For a basic replication of the model from the paper, see notebooks/analyze.ipynb.

pummeler's People

Contributors

Stargazers

Watchers

Forkers

flaxter merico34 pamato jeremy43 angelgarciacerdanya alnurali jeffreyawright

pummeler's Issues

CLI issue

$ python pummel featurize --seed 1 --skip-feats POWSP regions
usage: pummel featurize [-h] [--chunksize LINES] [--skip-rbf]
[--n-freqs N_FREQS] [--bandwidth BANDWIDTH]
[--rff-orthogonal | --rff-normal] [--seed SEED]
[--skip-feats FEAT_NAME [FEAT_NAME ...]]
dir [outfile]
pummel featurize: error: too few arguments

read_chunks directly from a census zip file

Should work fine with ZipFile.open; dunno how speed will be.

Log transform for US$ variables

Here's the variables I think we should log transform, all representing income/wages/etc.

VERSIONS = {
...
    'log_transform_feats': '''INTP OIP PAP RETP SEMP SSIP SSP WAGP PERNP
                            PINCP'''.split(),

Only issue is that some of these variables can be negative (for losses). So I guess the transformation for those should be x = log(x - min(x)) or something?

Once we figure that out it should be easy to put this into get_dummies.

sort: means/stds, value counts ignore person weighting

This has the strange effect that eg the mean standardized PINCP across the US is -0.16. Probably not a huge deal, but still.

allocation flags

Should we be including all of these variables?

FBDSP - Number of bedrooms allocation flag
FBLDP - Units in structure allocation flag
FBUSP - Business or medical office on property allocation flag

etc.

What do they mean?

Error when running `pummel featurize SORT_DIR`

When I ran pummel featurize SORT_DIR I received the next error:

Picking bandwidth by median heuristic...Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "/home/lizette/git_things/pummeler/pummeler/cli.py", line 197, in main
    args.func(args, parser)
  File "/home/lizette/git_things/pummeler/pummeler/cli.py", line 236, in do_featurize
    common_feats=args.common_feats)
  File "/home/lizette/git_things/pummeler/pummeler/featurize.py", line 217, in get_embeddings
    stats, skip_feats=skip_feats)
  File "/home/lizette/git_things/pummeler/pummeler/featurize.py", line 181, in pick_gaussian_bandwidth
    stats['sample'], stats, ret_df=False, skip_feats=skip_feats))
  File "/home/lizette/git_things/pummeler/pummeler/featurize.py", line 49, in get_dummies
    reals[:] -= stats['real_means'][real_feats]
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 727, in wrapper
    dtype=dtype,
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 635, in _construct_result
    return left._constructor(result, index=index, name=name, dtype=dtype)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 248, in __init__
    raise_cast_failure=True)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/series.py", line 3027, in _sanitize_array
    raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional

I found a work around by adding .values to the code in lines 49 and 50 of featurize.py

   reals[:] -= stats['real_means'][real_feats].values
    reals[:] /= stats['real_stds'][real_feats].values

Is this a good way to fix this?

Featurization notes

CITWP, YOEP, JWMNP: mean-coding blanks might not be the right thing, since blank means the person was born in the US / doesn't work
MLP* (when served in military) could probably be simplified; VPS does that
NWAB, NWAV, NWLA, NWLK, NWRE are recoded into ESR
RELP (relationship to reference person) is kind of a weird feature
hierarchical featurization for ANCP / FODP / INDP / NAICSP / OCCP / SOCP?
- for NAICS/SOC see https://www.census.gov/people/io/methodology/indexes.html
merge ANC1P/ANC2P, RAC1P/RAC2P/...?
re-featurize JWAP/JWDP to be circular?
Things that refer to specific in-US places: MIGPUMA, MIGSP, POBP, POWPUMA, POWSP
POVPIP: pretty sharp featurization difference between 500 and 501

sort on 2015 data

Something going on with 2015 data:


$ ./pummel sort -z pums_2015.zip sorted_15 --version 2015 
File ss15pusa.csv
/ 0 Elapsed Time: 0:00:00                                                                                                                      Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "pummeler-head/pummeler/cli.py", line 112, in main
    args.func(args, parser)
  File "pummeler-head/pummeler/cli.py", line 122, in do_sort
    adj_inc=True, version=args.version, chunksize=args.chunksize)
  File "pummeler-head/pummeler/sort.py", line 63, in sort_by_region
    version=version):
  File "pummeler-head/pummeler/reader.py", line 19, in read_chunks
    for chunk in chunks:
  File "/homes/flaxman/.local/lib/python2.7/site-packages/pandas/io/common.py", line 113, in <lambda>
    BaseIterator.next = lambda self: self.__next__()
  File "/homes/flaxman/.local/lib/python2.7/site-packages/pandas/io/parsers.py", line 915, in __next__
    return self.get_chunk()
  File "/homes/flaxman/.local/lib/python2.7/site-packages/pandas/io/parsers.py", line 971, in get_chunk
    return self.read(nrows=size)
  File "/homes/flaxman/.local/lib/python2.7/site-packages/pandas/io/parsers.py", line 938, in read
    ret = self._engine.read(nrows)
  File "/homes/flaxman/.local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1507, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 846, in pandas.parser.TextReader.read (pandas/parser.c:10364)
  File "pandas/parser.pyx", line 880, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10845)
  File "pandas/parser.pyx", line 922, in pandas.parser.TextReader._read_rows (pandas/parser.c:11386)
  File "pandas/parser.pyx", line 909, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11257)
  File "pandas/parser.pyx", line 2018, in pandas.parser.raise_parser_error (pandas/parser.c:26979)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 284 fields in line 59022, saw 318

Featurization issues

MIGPUMA has joint meaning with MIGSP; same for POWPUMA/POWSP.
Why does RELP come up so much in the ridge models? What does it mean in practice?

dask?

Seems like this might be a decent use-case for dask.

analysis code

sorting into states directly is slow

Seems like maybe pandas/pytables append is a lot slower than writing into a new file. (Or else the rewriting-when-strings-are-longer code is hitting a lot.)

The sort step should probably pre-count lines per PUMA in stats, and maybe max string lengths for the things that need that. Then we can preallocate file sizes and write into them, instead of appending.

Probably should (also?) consider using feather or parquet instead of hdf5.

Are we excluding non-voters?

In the KDD paper I excluded those under 18 and non-citizens, but since we're now thinking about "group-based" modeling I think it's fine to include all residents, meaning kids and non-citizens, too. Just wanted to check what's happening in the code currently.

@dougalsutherland

warnings/errors from featurize

when I run:


# ./pummel featurize --seed 17 --subsets "DEAR == 1, DEYE == 1, DOUT == 1, DRAT == 2 | DRAT == 3 | DRAT == 4 | DRAT == 5, DREM == 1, ENG == 2, ENG == 3 | ENG == 4, FER == 1, GCL == 1, GCR == 1, HINS2 == 1, HINS3 == 1, HINS4 == 1, HINS5 == 1, HINS6 == 1, HINS7 == 1" --do-my-additive --skip-rbf regions regions/socio.npz 
  0% (      0 of 9222637) |                                                                                              | Elapsed Time: 0:00:00 ETA:  --:--:--/home/ubuntu/pummeler/pummeler/featurize.py:534: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['NAICSP'] = df.NAICSP.map(naics_cat, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:538: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['OCCP'] = df.OCCP.astype(float).map(occ_cat, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:544: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['FOD1P'] = df.FOD1P.map(fod_cats, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:546: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['FOD2P'] = df.FOD2P.map(fod_cats, na_action='ignore')
/home/ubuntu/pummeler/pummeler/featurize.py:549: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['ANYHISP'] = (df.HISP > 1).astype(int)
/home/ubuntu/pummeler/pummeler/featurize.py:551: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df['HASDEGREE'] = (df.SCHL >= 20).astype(int)
/home/ubuntu/pummeler/pummeler/featurize.py:556: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  'hispanic').map(_ethnicity_map)
 14% (1360078 of 9222637) |##############                                                                                  | Elapsed Time: 0:02:50 ETA: 0:17:56

Also one of my featurization runs crashed at the end---maybe because I was out of tmp space? Do you know how to tell python to use a different tmp directory? Or maybe I'll just disable the savez_compressed and I just won't put these into Dropbox.

Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 153, in main
    args.func(args, parser)
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 185, in do_featurize
    np.savez_compressed(args.outfile, **res)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 600, in savez_compressed
    _savez(file, args, kwds, True)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 639, in _savez
    pickle_kwargs=pickle_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/format.py", line 584, in write_array

featurizer

faster featurizer

The old Cython featurizer only took two minutes on low1 once dummies had been created; this new one takes two hours. Dunno how long dummies took, but not two hours.

What's the significant difference? Speed that up....

Demographic featurizer

Efficient Bayesian ridge regression

Using 100 KDE features and all the categorical variables, I end up with a dataset that's 840x6578 so I'm inclined to do ridge regression. I tried to implement it in Stan but it's taking forever to sample the 6578 parameters, I think because there's so much correlation among the covariate. One trick that speeds things up greatly is to do a QR decomposition in R (see stan-dev/rstanarm#30) and then learn 840 parameters based on an orthogonal design matrix. This works great, but I'm not sure how to recover the original 6578 parameters. But in any case, this might all be moot, as even with 6578 parameters I don't really know how to read off something like effect sizes / percent of variance explained. So now I'm thinking I should just do good old fashioned forward stepwise regression. But even that is slow with Stan--i.e. at the moment I've got a matrix that 840x100 because I want to use the 100 KDE features for the age variable.

So should I forget Stan? Or do something else? Interestingly, the GP regression perspective on this is efficient in Stan: considering a linear kernel, the covariance K becomes 840x840 and you just sample the observations:

y ~ N(0, K + \sigma^2 I)

But again this doesn't give us a clear interpretation of which are the important variables. So maybe I should really just do group lasso? I guess there are some group lasso packages in R to try...

new subsets feature crashes when zero or one subsets used

No subsets:

ubuntu@ip-172-31-47-233:~/pummeler$ ./pummel featurize regions
Picking bandwidth by median heuristic...picked 10.0905135193
  0% (      0 of 9222637) |                                                                                              | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 116, in main
    args.func(args, parser)
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 148, in do_featurize
    subsets=args.subsets)
  File "/home/ubuntu/pummeler/pummeler/featurize.py", line 206, in get_embeddings
    c = c.loc[keep]
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1311, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1481, in _getitem_axis
    self._has_valid_type(key, axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1418, in _has_valid_type
    error()
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1405, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [True] is not in the [index]'
Closing remaining open files:regions/feats_SC_10_06.h5...done

One subset:

ubuntu@ip-172-31-47-233:~/pummeler$ ./pummel featurize --subsets "SEX == 1" regions 
Picking bandwidth by median heuristic...picked 10.0905135193
  0% (      0 of 9222637) |                                                                                              | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "./pummel", line 5, in <module>
    main()
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 116, in main
    args.func(args, parser)
  File "/home/ubuntu/pummeler/pummeler/cli.py", line 148, in do_featurize
    subsets=args.subsets)
  File "/home/ubuntu/pummeler/pummeler/featurize.py", line 206, in get_embeddings
    c = c.loc[keep]
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1311, in __getitem__
    return self._getitem_axis(key, axis=0)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1481, in _getitem_axis
    self._has_valid_type(key, axis)
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1418, in _has_valid_type
    error()
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1405, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [True] is not in the [index]'
Closing remaining open files:regions/feats_SC_10_06.h5...done

election data

Put instructions in about getting the CQ Press data file
Make it work with HuffPo results