Giter Site home page Giter Site logo

doppelganger's People

Contributors

ahardjasa avatar anthonylouisburns avatar bnaul avatar dosinga avatar kaelgreco avatar katbusch avatar mogeng avatar nikisix avatar sidewalklabs-replica avatar stefanobaghino avatar zihenglin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doppelganger's Issues

nosetests

nosetests do not pass
using Python 3.6.3

Preferred communication medium?

@DavidOry requested 'reaching out' via GitHub. I have a non-technical query regarding highlighting Doppelgänger to an audience in Australia. How might I best proceed with my communication?

test_balance_cvx_relaxed failing on new versions of cvxpy

    def test_balance_cvx_relaxed(self):
        hh_table, A, w, mu, expected_weights = self._mock_list_relaxed()
        hh_weights, _ = listbalancer.balance_cvx(hh_table, A, w, mu)
        np.testing.assert_allclose(
>           hh_weights, expected_weights, rtol=0.01, atol=0)
E       AssertionError: 
E       Not equal to tolerance rtol=0.01, atol=0
E       
E       (mismatch 100.0%)
E        x: matrix([[ 29.798231],
E               [ 37.155819],
E               [ 55.549788],
E               [157.820203]])
E        y: matrix([[45.],
E               [52.],
E               [65.],
E               [98.]])

test/test_listbalancer.py:220: AssertionError

I've tried downgrading cvxpy, numpy, and pandas to working versions but ran into issues where those older versions are no longer working, so there's not currently a way to get this into a working state

Python 3 compatibility

I'm not sure if Python 3 support is in your plans, but if so, I ran into a few issues when trying to run this package under Python 3.6:

  • unicode (used in inputs.py) is not supported in Python 3. I think this can be addressed by adding from builtins import str and replacing unicode with str.

  • xrange (used in bayesnets.py) is not supported in Python 3. Could either do from builtins import range and replace xrange with range or use from past.builtins import xrange.

Add state & county IDs

Right now we only include the tract part of the ID which isn't guaranteed to be unique. This should be threaded through the code & included in the household_id

Pomegrenate issues on CentOS 7.3

Trying to get my setup working. I am running the pip install command on a CentOS server.
I checked my gcc version and it does not require an upgrade at the moment:

[camidus@urbanizer doppelganger]$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)

Errors returned below:

` gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/base.c -o build/temp.linux-x86_64-2.7/pomegranate/base.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/base.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/base.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/base.so
building 'pomegranate.BayesianNetwork' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/BayesianNetwork.c -o build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/BayesianNetwork.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/BayesianNetwork.so
building 'pomegranate.FactorGraph' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/FactorGraph.c -o build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/FactorGraph.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/FactorGraph.so
building 'pomegranate.distributions' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/distributions.c -o build/temp.linux-x86_64-2.7/pomegranate/distributions.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/distributions.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/distributions.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/distributions.so
building 'pomegranate.fsm' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/fsm.c -o build/temp.linux-x86_64-2.7/pomegranate/fsm.o
gcc: error: pomegranate/fsm.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4


Failed building wheel for pomegranate
Running setup.py clean for pomegranate
Failed to build pomegranate
Installing collected packages: pomegranate, chardet, idna, urllib3, requests, doppelganger
Running setup.py install for pomegranate ... error
Complete output from command /bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-hVt89H/pomegranate/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-iPa6Ki-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/pomegranate
copying pomegranate/init.py -> build/lib.linux-x86_64-2.7/pomegranate
running build_ext
building 'pomegranate.base' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/pomegranate
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/base.c -o build/temp.linux-x86_64-2.7/pomegranate/base.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/base.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/base.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/base.so
building 'pomegranate.BayesianNetwork' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/BayesianNetwork.c -o build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/BayesianNetwork.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/BayesianNetwork.so
building 'pomegranate.FactorGraph' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/FactorGraph.c -o build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/FactorGraph.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/FactorGraph.so
building 'pomegranate.distributions' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/distributions.c -o build/temp.linux-x86_64-2.7/pomegranate/distributions.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/distributions.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/distributions.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/distributions.so
building 'pomegranate.fsm' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/fsm.c -o build/temp.linux-x86_64-2.7/pomegranate/fsm.o
gcc: error: pomegranate/fsm.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4

----------------------------------------

Command "/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-hVt89H/pomegranate/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-iPa6Ki-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-hVt89H/pomegranate/
`

issue with cvxpy.Variable functions in example code

Hi there, I am running into an issue during the allocation step when I try to run your example code. I am currently using Python 3 and I am getting the following error. Any help on this issue would be much appreciated- thanks so much!

TypeError Traceback (most recent call last)
in
----> 1 allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)

~/anaconda3/lib/python3.7/site-packages/doppelganger/allocation.py in from_cleaned_data(marginals, households_data, persons_data)
77 households_data.data, persons_data.data)
78 allocated_households, allocated_persons =
---> 79 HouseholdAllocator._allocate_households(households, persons, marginals)
80 return HouseholdAllocator(allocated_households, allocated_persons)
81

~/anaconda3/lib/python3.7/site-packages/doppelganger/allocation.py in _allocate_households(households, persons, tract_controls)
174
175 hh_weights = balance_multi_cvx(
--> 176 hh_table, A, B, w_extend, gamma * mu_extend.T, meta_gamma
177 )
178

~/anaconda3/lib/python3.7/site-packages/doppelganger/listbalancer.py in balance_multi_cvx(hh_table, A, B, w, mu, meta_mu, verbose_solver)
121
122 n_tracts = w.shape[0]
--> 123 x = cvx.Variable(n_tracts, n_samples)
124
125 # Relative weights of tracts

~/anaconda3/lib/python3.7/site-packages/cvxpy/expressions/variable.py in init(self, shape, name, var_id, **kwargs)
73 self._name = name
74 else:
---> 75 raise TypeError("Variable name %s must be a string." % name)
76
77 self._value = None

TypeError: Variable name 917 must be a string.

Error in doppelganger_example_simple.ipynb

Hello, while going over the simple example notebook in the examples folder I am getting the following error:
Missing data field state
After a few minutes digging through the code, I realized that the state field is part of the list allocation.DEFAULT_HOUSEHOLD_FIELDS. However, the households_00106_dirty.csv file does not contain a state column and thus when creating the households_data object it throws the exception.

Am I doing something wrong or the csv file needs to be modified? The full traceback is below. Thanks!

Missing data field state
    KeyError Traceback (most recent call last)
    <ipython-input-24-5e4c8b217b33> in <module>()
    1 households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
    ----> 2     household_fields, preprocessor, puma=PUMA
    3 )
    4 
    5 persons_fields = tuple(set(
    /usr/local/lib/python2.7/dist-packages/doppelganger/datasource.pyc in clean(self, field_names, preprocessor, state, puma)
     30         if puma is not None:
     31             cleaned_data = cleaned_data[
---> 32                     (cleaned_data[inputs.STATE.name].astype(str) == str(state)) &
     33                     (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
     34                 ]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: u'state'`

Generated household repeat_index incorrect

When running the example notebook, Doppelganger.ipynb, the population output in step 03 seems incorrect.

Pandas DataFrame for first 5 people:

         tract  serial_number  repeat_index    age sex individual_income
138842  422209           4431             0    65+   M               <=0
138843  422209           4431             1    65+   M               <=0
138897  422209           4431             0  35-64   F               <=0
138898  422209           4431             1  35-64   F           100000+
54123   422209          12930             0  35-64   M       40000-80000

Pandas DataFrame for first 5 households:

         tract  serial_number  repeat_index num_people household_income  \
0       422209           4431             0          2          <=40000   
1       422209           4431             1          2           40000+   
90603   422209           4431             0          2           40000+   
90604   422209           4431             1          2           40000+   
181206  422209           4431             0          2           40000+   

       num_vehicles  
0               1.0  
1               1.0  
90603           2.0  
90604           2.0  
181206          2.0 

I expected to see sequential non-duplicate repeat indices for tract, serial_number pairs in households, e.g. the repeat indices column would be 0,1,2,3,4

Modifying inputs.py

Hi everyone,

We are currently using doppelganger for our own set of data in our region. The example is working for us when we use our own data and the generated household table is exactly what we need. The only problem is that we do need discrete numbers for some of the categories, in our case it would be household_income and num_people (some of the values are categorical but we would need specific numbers).

We have downloaded the most recent version of doppelganger and been using it via Jupyter Notebook. In the doppelganger full example it mentions accessing inputs.py to make adjustments to output variables. After modifying the inputs.py file and running the example we noticed the outputs do not change at all. Are we suppose to modify the inputs.py file within the download we have or is there another inputs.py that we should be working with?

To clarify, we have our doppelganger location at 'C:\Users\someUser\doppelgangerCU' and we've been modifying the inputs at 'C:\Users\someUser\doppelgangerCU\doppelganger\inputs.py'.

We'd appreciate any help, thanks!

marginals dtypes

I came across this issue in doppelganger_example_full.ipynb when creating the marginals from the census data

new_marginal_filename = os.path.join(output_dir, 'new_marginals.csv')

with open('sample_data/2010_puma_tract_mapping.txt') as csv_file:
    csv_reader = csv.DictReader(csv_file)
    marginals = Marginals.from_census_data(
        csv_reader, CENSUS_KEY, state=STATE, pumas=PUMA
    )
    marginals.write(new_marginal_filename)

and passing marginals directly to the HouseholdAllocator as in:

allocator = HouseholdAllocator.from_cleaned_data(marginals, households_data, persons_data)

yields:

Truncated Traceback (Use C-c C-x to view full TB):
/home/martibosch/activitysim/src/doppelganger/doppelganger/allocation.pyc in _allocate_households(households, persons, tract_controls)
    163         w_extend = np.tile(w, (n_tracts, 1))
    164         mu_extend = np.mat(np.tile(mu, (n_tracts, 1)))
--> 165         B = np.mat(np.dot(np.ones((1, n_tracts)), A)[0])
    166 
    167         # Our trade-off coefficient gamma

TypeError: can't multiply sequence by non-int of type 'float'

(this does not happen hen reading marginals from a csv i.e. marginals = Marginals.from_csv(new_marginal_filename) since the types are correctly inferred.

So I guess this could be fixed by explicitly controlling the dtypes as in:

modified   doppelganger/marginals.py
@@ -165,8 +165,12 @@ class Marginals(object):
                     output.append(str(controls_dict[control_name]))
                 data.append(output)
 
-        columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE'] + list(CONTROL_NAMES)
-        return Marginals(pandas.DataFrame(data, columns=columns))
+        code_columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE']
+        control_columns = list(CONTROL_NAMES)
+        marginals_df = pandas.DataFrame(data, columns=code_columns + control_columns)
+        marginals_df[code_columns] = marginals_df[code_columns].astype(str)
+        marginals_df[control_columns] = marginals_df[control_columns].astype(int)

This conflicts with the test MarginalsTest.test_fetch_marginals, but the test could be easily fixed since it is only a matter of python strings:

_________________________________________________________ MarginalsTest.test_fetch_marginals _________________________________________________________

self = <test_marginals.MarginalsTest testMethod=test_fetch_marginals>

    def test_fetch_marginals(self):
        state = self._mock_marginals_file()[0]['STATEFP']
        puma = self._mock_marginals_file()[0]['PUMA5CE']
        with patch('doppelganger.marginals.Marginals._fetch_from_census',
                   return_value=self._mock_response()):
            marg = Marginals.from_census_data(
                    puma_tract_mappings=self._mock_marginals_file(), census_key=None,
                    state=state, pumas=set([puma])
                )
        expected = {
            'STATEFP': '06',
            'COUNTYFP': '075',
            'PUMA5CE': '07507',
            'TRACTCE': '023001',
            'age_0-17': '909',
            'age_18-34': '1124',
            'age_65+': '713',
            'age_35-64': '2334',
            'num_people_count': '1335',
            'num_people_1': '168',
            'num_people_3': '304',
            'num_people_2': '341',
            'num_people_4+': '522',
            'num_vehicles_0': '0',
            'num_vehicles_1': '1',
            'num_vehicles_2': '2',
            'num_vehicles_3+': '3'
        }
        result = marg.data.loc[0].to_dict()
>       self.assertDictEqual(result, expected)
E       AssertionError: {u'num_people_4+': 522, u'num_people_3': 304, u'num_people_2': 341, u'num_people [truncated]... != {u'num_people_4+': u'522', u'age_18-34': u'1124', u'num_people_1': u'168', u'age [truncated]...
E       Diff is 1298 characters long. Set self.maxDiff to None to see it.

test/test_marginals.py:86: AssertionError

I guess more issues of this types could be encountered, so perhaps there should be an overall strategy to deal with the columns dtypes.
If you agree, I could spend some time on it and submit a PR :)

Memory pressure in generation

Generation can fail because of memory pressure, eg on PUMA 05302

  File "scripts/generate_all_pumas.py", line 158, in generate_population
    population = Population.generate(allocator, person_model, household_model)
  File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/populationgen.py", line 93, in generate
    person_model, [inputs.AGE.name, inputs.SEX.name], Population._extract_person_evidence
  File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/populationgen.py", line 76, in _generate_from_model
    results_dataframe = pandas.DataFrame(results, columns=column_names)
  File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 314, in __init__
    arrays, columns = _to_arrays(data, columns, dtype=dtype)
  File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 5715, in _to_arrays
    dtype=dtype)
  File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 5789, in _list_to_arrays
    content = list(lib.to_object_array_tuples(data).T)
  File "pandas/_libs/src/inference.pyx", line 1660, in pandas._libs.lib.to_object_array_tuples (pandas/_libs/lib.c:67515)
MemoryError

doppelganger.person_structure -> pomegranite.BayesianNetwork

Running into the following error when trying to train the bayesnet from the simple example notebook.

    194                 # Make defensive copy
    195                 data = list(data) + list(prior_data)
--> 196             bayesian_network = BayesianNetwork.from_structure(data, structure)
    197             type_to_network[type_] = bayesian_network
    198         return BayesianNetworkModel(type_to_network, fields, segmenter=input_data.segmenter)

pomegranate/BayesianNetwork.pyx in pomegranate.BayesianNetwork.BayesianNetwork.from_structure()

TypeError: unsupported operand type(s) for +: 'frozenset' and 'tuple'

Seems to be coming from the following line in pomegranate's BayesianNetwork.pyx -- line 820

    nodes[i] = ConditionalProbabilityTable.from_samples(X[:,parents+(i,)],
        parents=[nodes[parent] for parent in parents],
        weights=weights, pseudocount=pseudocount)

Where i and parents are directly from Doppelganger's configuration.person_structure,

for i, parents in enumerate(structure):

Anyone else come across this one? I've tried upgrading pomegranate to 0.7.7 and installing from source (ignoring pip's cache) and still get the same issue.
Setup: OSX Yosemite, pomegranate 0.7.1

IEEE Computer Society Magazine for an upcoming edition on Governments in the Age of Big Data and Smart Cities(December 2018)

I am a guest editor for IEEE Computer Society Magazine for an upcoming edition on Governments in the Age of Big Data and Smart Cities(December 2018). I wanted to see if the doppelganger team would be interested in submitting an article showcasing your work and why using data for city planning/management is going to be critical in the future.

If interested please contact me at [email protected] and would be happy to answer any questions.

https://publications.computer.org/computer-magazine/2018/01/08/governments-age-big-data-smart-cities-call-papers/

1 year vs 5 year PUMS input difference

Hello,

I was experimenting with doppelganger using 1 year (2015) and 5 year (2011-2015) ACS PUMS records, and derived at very different sets of households/persons records. The 5-year PUMS record has larger sample size, however resulted in smaller sets of households/persons. I would appreciate any explanation as I might not be understanding/using the tool correctly.

You can find the inputs here, the outputs here, and the notebook here.

Thanks!
Shuake

Segmenting on an input variable that allows the None type causes a sorting error

Traceback (most recent call last):
File "doppelganger/scripts/download_allocate_generate.py", line 332, in
main()
File "doppelganger/scripts/download_allocate_generate.py", line 317, in main
person_segmenter, household_segmenter
File "doppelganger/scripts/download_allocate_generate.py", line 181, in create_bayes_net
household_model.write(household_model_filename)
File "/Users/six/code/doppelganger/doppelganger/bayesnets.py", line 92, in write
json_string = self.to_json()
File "/Users/six/code/doppelganger/doppelganger/bayesnets.py", line 100, in to_json
return json.dumps(blob, indent=4, sort_keys=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 209, in encode
chunks = list(chunks)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 361, in _iterencode_dict
items = sorted(dct.items(), key=lambda kv: kv[0])
File "/Library/Python/2.7/site-packages/future-0.16.0-py2.7.egg/future/types/newstr.py", line 316, in gt
raise TypeError(self.unorderable_err.format(type(other)))
TypeError: unorderable types: str() and <type 'NoneType'>

Links to References

Using Chrome Version 58.0.3029.110 (64-bit), I wasn't able to access either the first or second references cited as inspiration for Doppleganger. First link just never resolved and second link said "unable to load document" (however, dialog box offered option to reload and it worked the second time).

Length mismatch

Hi

I'm using the sample data and trying to run the example provided in doppelganger_example_full.ipynb. However, I get this error and cannot figure out what's the problem. Can you please help me with it?

allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/doppelganger/allocation.py", line 77, in from_cleaned_data
    households_data.data, persons_data.data)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/doppelganger/allocation.py", line 212, in _format_data
    ._str_broadcast(inputs.AGE.name, list(inputs.AGE.possible_values))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 4385, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 645, in _set_axis
    self._data.set_axis(axis, labels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 3323, in set_axis
    'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 0 elements, new values have 4 elements

Generated population doesn't match input controls

Running the example notebook with the included marginals control file, I noticed a discrepancy between the num_people_count field and the resulting number of synthetic households.

The total number of households in tracts contained in the marginals file controls.data['num_people_count'].sum() is 46,945. (The name of this field is also somewhat misleading, because ACS table B11016 is a table of households by number of people, not the number of people, but that's not the issue here). When I generate the population for the PUMA in the example, the resulting population population.generated_households['household_id'].count() is 73,644. BTW, the total weighted households in the PUMS data is 97,841.

I wanted to see if this error was sensitive to the marginals file. So I deleted all but the first nine tracts in the file, whittling the number of households in the included tracts to 16,889. In this case, doppelganger returned a population with 54,421 households.

Is there an additional step where I need to downsample the synthetic population to match the marginal targets? Is there something that I don't understand? I've included my script in this gist; I used the most recent commit on master, running in Python 3.

python3 accuracy.py
INFO:__main__:Loading configuration and data
INFO:__main__:Loading model
INFO:__main__:File 	 	 PUMS 	 Controls 	 Generated
INFO:__main__:sample_data/marginals_00106.csv 	 	 97841 	 46945 	 73644
INFO:__main__:sample_data/marginals_00106_modified.csv 	 	 97841 	 16889 	 54421

Broadcast error in list balancer

Error in puma 03706
Traceback (most recent call last):
  File "scripts/generate_all_pumas.py", line 123, in generate_population
    marginals, households, persons)
  File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/allocation.py", line 73, in from_cleaned_data
    HouseholdAllocator._allocate_households(households, persons, marginals)
  File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/allocation.py", line 147, in _allocate_households
    hh_table, A, B, w_extend, gamma * mu_extend.T, meta_gamma
  File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/listbalancer.py", line 168, in balance_multi_cvx
    weights_out = np.insert(weights_out, zero_marginals, zero_weights, 0)
  File "/home/kat/.local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4910, in insert
    new[slobj] = values
ValueError: could not broadcast input array from shape (1,4) into shape (1,799)

keep leading zeros in code columns - dtypes

In dataframes, columns corresponding to state and puma codes should preserve the leading zeros e.g. 00106 becomes 106 when pandas automatically infers numeric types
As encountered with doppelganger_example_simple.ipynb, this can lead to wrong filter results in dataframes e.g. in lines 31-34 of doppelganger/datasource.py

cleaned_data = cleaned_data[
    (cleaned_data[inputs.STATE.name].astype(str) == str(state)) &
    (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
]

the left-hand-side can become 106 (due to pandas auto dtype inference) whereas the right-hand-side (passed by the user) is '00106'.
As remarked in #59, there should be a general strategy to control the columns dtypes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.