replicahq / doppelganger Goto Github PK
View Code? Open in Web Editor NEWA Python package of tools to support population synthesizers
License: Apache License 2.0
A Python package of tools to support population synthesizers
License: Apache License 2.0
Hello,
I was experimenting with doppelganger using 1 year (2015) and 5 year (2011-2015) ACS PUMS records, and derived at very different sets of households/persons records. The 5-year PUMS record has larger sample size, however resulted in smaller sets of households/persons. I would appreciate any explanation as I might not be understanding/using the tool correctly.
You can find the inputs here, the outputs here, and the notebook here.
Thanks!
Shuake
Using Chrome Version 58.0.3029.110 (64-bit), I wasn't able to access either the first or second references cited as inspiration for Doppleganger. First link just never resolved and second link said "unable to load document" (however, dialog box offered option to reload and it worked the second time).
Running the example notebook with the included marginals control file, I noticed a discrepancy between the num_people_count
field and the resulting number of synthetic households.
The total number of households in tracts contained in the marginals file controls.data['num_people_count'].sum()
is 46,945. (The name of this field is also somewhat misleading, because ACS table B11016 is a table of households by number of people, not the number of people, but that's not the issue here). When I generate the population for the PUMA in the example, the resulting population population.generated_households['household_id'].count()
is 73,644. BTW, the total weighted households in the PUMS data is 97,841.
I wanted to see if this error was sensitive to the marginals file. So I deleted all but the first nine tracts in the file, whittling the number of households in the included tracts to 16,889. In this case, doppelganger returned a population with 54,421 households.
Is there an additional step where I need to downsample the synthetic population to match the marginal targets? Is there something that I don't understand? I've included my script in this gist; I used the most recent commit on master, running in Python 3.
python3 accuracy.py
INFO:__main__:Loading configuration and data
INFO:__main__:Loading model
INFO:__main__:File PUMS Controls Generated
INFO:__main__:sample_data/marginals_00106.csv 97841 46945 73644
INFO:__main__:sample_data/marginals_00106_modified.csv 97841 16889 54421
@DavidOry requested 'reaching out' via GitHub. I have a non-technical query regarding highlighting Doppelgänger to an audience in Australia. How might I best proceed with my communication?
Hello, while going over the simple example notebook in the examples folder I am getting the following error:
Missing data field state
After a few minutes digging through the code, I realized that the state field is part of the list allocation.DEFAULT_HOUSEHOLD_FIELDS
. However, the households_00106_dirty.csv
file does not contain a state column and thus when creating the households_data
object it throws the exception.
Am I doing something wrong or the csv file needs to be modified? The full traceback is below. Thanks!
Missing data field state
KeyError Traceback (most recent call last)
<ipython-input-24-5e4c8b217b33> in <module>()
1 households_data = PumsData.from_csv('sample_data/households_00106_dirty.csv').clean(
----> 2 household_fields, preprocessor, puma=PUMA
3 )
4
5 persons_fields = tuple(set(
/usr/local/lib/python2.7/dist-packages/doppelganger/datasource.pyc in clean(self, field_names, preprocessor, state, puma)
30 if puma is not None:
31 cleaned_data = cleaned_data[
---> 32 (cleaned_data[inputs.STATE.name].astype(str) == str(state)) &
33 (cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
34 ]
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
2057 return self._getitem_multilevel(key)
2058 else:
-> 2059 return self._getitem_column(key)
2060
2061 def _getitem_column(self, key):
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
2064 # get column
2065 if self.columns.is_unique:
-> 2066 return self._get_item_cache(key)
2067
2068 # duplicate columns & possible reduce dimensionality
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
1384 res = cache.get(item)
1385 if res is None:
-> 1386 values = self._data.get(item)
1387 res = self._box_item_values(item, values)
1388 cache[item] = res
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
3541
3542 if not isnull(item):
-> 3543 loc = self.items.get_loc(item)
3544 else:
3545 indexer = np.arange(len(self.items))[isnull(self.items)]
/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
2134 return self._engine.get_loc(key)
2135 except KeyError:
-> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key))
2137
2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()
pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()
pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
KeyError: u'state'`
def test_balance_cvx_relaxed(self):
hh_table, A, w, mu, expected_weights = self._mock_list_relaxed()
hh_weights, _ = listbalancer.balance_cvx(hh_table, A, w, mu)
np.testing.assert_allclose(
> hh_weights, expected_weights, rtol=0.01, atol=0)
E AssertionError:
E Not equal to tolerance rtol=0.01, atol=0
E
E (mismatch 100.0%)
E x: matrix([[ 29.798231],
E [ 37.155819],
E [ 55.549788],
E [157.820203]])
E y: matrix([[45.],
E [52.],
E [65.],
E [98.]])
test/test_listbalancer.py:220: AssertionError
I've tried downgrading cvxpy, numpy, and pandas to working versions but ran into issues where those older versions are no longer working, so there's not currently a way to get this into a working state
Trying to get my setup working. I am running the pip install command on a CentOS server.
I checked my gcc version and it does not require an upgrade at the moment:
[camidus@urbanizer doppelganger]$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
Errors returned below:
` gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/base.c -o build/temp.linux-x86_64-2.7/pomegranate/base.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/base.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/base.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/base.so
building 'pomegranate.BayesianNetwork' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/BayesianNetwork.c -o build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/BayesianNetwork.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/BayesianNetwork.so
building 'pomegranate.FactorGraph' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/FactorGraph.c -o build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/FactorGraph.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/FactorGraph.so
building 'pomegranate.distributions' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/distributions.c -o build/temp.linux-x86_64-2.7/pomegranate/distributions.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/distributions.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/distributions.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/distributions.so
building 'pomegranate.fsm' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/fsm.c -o build/temp.linux-x86_64-2.7/pomegranate/fsm.o
gcc: error: pomegranate/fsm.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4
Failed building wheel for pomegranate
Running setup.py clean for pomegranate
Failed to build pomegranate
Installing collected packages: pomegranate, chardet, idna, urllib3, requests, doppelganger
Running setup.py install for pomegranate ... error
Complete output from command /bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-hVt89H/pomegranate/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-iPa6Ki-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/pomegranate
copying pomegranate/init.py -> build/lib.linux-x86_64-2.7/pomegranate
running build_ext
building 'pomegranate.base' extension
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/pomegranate
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/base.c -o build/temp.linux-x86_64-2.7/pomegranate/base.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/base.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/base.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/base.so
building 'pomegranate.BayesianNetwork' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/BayesianNetwork.c -o build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/BayesianNetwork.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/BayesianNetwork.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/BayesianNetwork.so
building 'pomegranate.FactorGraph' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/FactorGraph.c -o build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/FactorGraph.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/FactorGraph.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/FactorGraph.so
building 'pomegranate.distributions' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/distributions.c -o build/temp.linux-x86_64-2.7/pomegranate/distributions.o
In file included from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1809:0,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:18,
from /usr/lib64/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4,
from pomegranate/distributions.c:444:
/usr/lib64/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
#warning "Using deprecated NumPy API, disable it by "
^
gcc -pthread -shared -Wl,-z,relro build/temp.linux-x86_64-2.7/pomegranate/distributions.o -L. -lpython2.7 -o build/lib.linux-x86_64-2.7/pomegranate/distributions.so
building 'pomegranate.fsm' extension
gcc -pthread -fno-strict-aliasing -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -D_GNU_SOURCE -fPIC -fwrapv -fPIC -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include -c pomegranate/fsm.c -o build/temp.linux-x86_64-2.7/pomegranate/fsm.o
gcc: error: pomegranate/fsm.c: No such file or directory
gcc: fatal error: no input files
compilation terminated.
error: command 'gcc' failed with exit status 4
----------------------------------------
Command "/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-hVt89H/pomegranate/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-iPa6Ki-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-hVt89H/pomegranate/
`
Generation can fail because of memory pressure, eg on PUMA 05302
File "scripts/generate_all_pumas.py", line 158, in generate_population
population = Population.generate(allocator, person_model, household_model)
File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/populationgen.py", line 93, in generate
person_model, [inputs.AGE.name, inputs.SEX.name], Population._extract_person_evidence
File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/populationgen.py", line 76, in _generate_from_model
results_dataframe = pandas.DataFrame(results, columns=column_names)
File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 314, in __init__
arrays, columns = _to_arrays(data, columns, dtype=dtype)
File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 5715, in _to_arrays
dtype=dtype)
File "/home/kat/.local/lib/python2.7/site-packages/pandas/core/frame.py", line 5789, in _list_to_arrays
content = list(lib.to_object_array_tuples(data).T)
File "pandas/_libs/src/inference.pyx", line 1660, in pandas._libs.lib.to_object_array_tuples (pandas/_libs/lib.c:67515)
MemoryError
Error in puma 03706
Traceback (most recent call last):
File "scripts/generate_all_pumas.py", line 123, in generate_population
marginals, households, persons)
File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/allocation.py", line 73, in from_cleaned_data
HouseholdAllocator._allocate_households(households, persons, marginals)
File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/allocation.py", line 147, in _allocate_households
hh_table, A, B, w_extend, gamma * mu_extend.T, meta_gamma
File "/home/kat/.local/lib/python2.7/site-packages/doppelganger/listbalancer.py", line 168, in balance_multi_cvx
weights_out = np.insert(weights_out, zero_marginals, zero_weights, 0)
File "/home/kat/.local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4910, in insert
new[slobj] = values
ValueError: could not broadcast input array from shape (1,4) into shape (1,799)
For tests to pass, we must pin pomegranate to 0.8.1. With newer versions pretty much every test that uses pomegranate fails
I am a guest editor for IEEE Computer Society Magazine for an upcoming edition on Governments in the Age of Big Data and Smart Cities(December 2018). I wanted to see if the doppelganger team would be interested in submitting an article showcasing your work and why using data for city planning/management is going to be critical in the future.
If interested please contact me at [email protected] and would be happy to answer any questions.
I'm not sure if Python 3 support is in your plans, but if so, I ran into a few issues when trying to run this package under Python 3.6:
unicode
(used in inputs.py
) is not supported in Python 3. I think this can be addressed by adding from builtins import str
and replacing unicode
with str
.
xrange
(used in bayesnets.py
) is not supported in Python 3. Could either do from builtins import range
and replace xrange
with range
or use from past.builtins import xrange
.
Right now we only include the tract part of the ID which isn't guaranteed to be unique. This should be threaded through the code & included in the household_id
Numpy requires a bunch of dependencies to work
I came across this issue in doppelganger_example_full.ipynb
when creating the marginals from the census data
new_marginal_filename = os.path.join(output_dir, 'new_marginals.csv')
with open('sample_data/2010_puma_tract_mapping.txt') as csv_file:
csv_reader = csv.DictReader(csv_file)
marginals = Marginals.from_census_data(
csv_reader, CENSUS_KEY, state=STATE, pumas=PUMA
)
marginals.write(new_marginal_filename)
and passing marginals
directly to the HouseholdAllocator
as in:
allocator = HouseholdAllocator.from_cleaned_data(marginals, households_data, persons_data)
yields:
Truncated Traceback (Use C-c C-x to view full TB):
/home/martibosch/activitysim/src/doppelganger/doppelganger/allocation.pyc in _allocate_households(households, persons, tract_controls)
163 w_extend = np.tile(w, (n_tracts, 1))
164 mu_extend = np.mat(np.tile(mu, (n_tracts, 1)))
--> 165 B = np.mat(np.dot(np.ones((1, n_tracts)), A)[0])
166
167 # Our trade-off coefficient gamma
TypeError: can't multiply sequence by non-int of type 'float'
(this does not happen hen reading marginals from a csv i.e. marginals = Marginals.from_csv(new_marginal_filename)
since the types are correctly inferred.
So I guess this could be fixed by explicitly controlling the dtypes as in:
modified doppelganger/marginals.py
@@ -165,8 +165,12 @@ class Marginals(object):
output.append(str(controls_dict[control_name]))
data.append(output)
- columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE'] + list(CONTROL_NAMES)
- return Marginals(pandas.DataFrame(data, columns=columns))
+ code_columns = ['STATEFP', 'COUNTYFP', 'PUMA5CE', 'TRACTCE']
+ control_columns = list(CONTROL_NAMES)
+ marginals_df = pandas.DataFrame(data, columns=code_columns + control_columns)
+ marginals_df[code_columns] = marginals_df[code_columns].astype(str)
+ marginals_df[control_columns] = marginals_df[control_columns].astype(int)
This conflicts with the test MarginalsTest.test_fetch_marginals
, but the test could be easily fixed since it is only a matter of python strings:
_________________________________________________________ MarginalsTest.test_fetch_marginals _________________________________________________________
self = <test_marginals.MarginalsTest testMethod=test_fetch_marginals>
def test_fetch_marginals(self):
state = self._mock_marginals_file()[0]['STATEFP']
puma = self._mock_marginals_file()[0]['PUMA5CE']
with patch('doppelganger.marginals.Marginals._fetch_from_census',
return_value=self._mock_response()):
marg = Marginals.from_census_data(
puma_tract_mappings=self._mock_marginals_file(), census_key=None,
state=state, pumas=set([puma])
)
expected = {
'STATEFP': '06',
'COUNTYFP': '075',
'PUMA5CE': '07507',
'TRACTCE': '023001',
'age_0-17': '909',
'age_18-34': '1124',
'age_65+': '713',
'age_35-64': '2334',
'num_people_count': '1335',
'num_people_1': '168',
'num_people_3': '304',
'num_people_2': '341',
'num_people_4+': '522',
'num_vehicles_0': '0',
'num_vehicles_1': '1',
'num_vehicles_2': '2',
'num_vehicles_3+': '3'
}
result = marg.data.loc[0].to_dict()
> self.assertDictEqual(result, expected)
E AssertionError: {u'num_people_4+': 522, u'num_people_3': 304, u'num_people_2': 341, u'num_people [truncated]... != {u'num_people_4+': u'522', u'age_18-34': u'1124', u'num_people_1': u'168', u'age [truncated]...
E Diff is 1298 characters long. Set self.maxDiff to None to see it.
test/test_marginals.py:86: AssertionError
I guess more issues of this types could be encountered, so perhaps there should be an overall strategy to deal with the columns dtypes.
If you agree, I could spend some time on it and submit a PR :)
When running the example notebook, Doppelganger.ipynb
, the population output in step 03 seems incorrect.
Pandas DataFrame for first 5 people:
tract serial_number repeat_index age sex individual_income
138842 422209 4431 0 65+ M <=0
138843 422209 4431 1 65+ M <=0
138897 422209 4431 0 35-64 F <=0
138898 422209 4431 1 35-64 F 100000+
54123 422209 12930 0 35-64 M 40000-80000
Pandas DataFrame for first 5 households:
tract serial_number repeat_index num_people household_income \
0 422209 4431 0 2 <=40000
1 422209 4431 1 2 40000+
90603 422209 4431 0 2 40000+
90604 422209 4431 1 2 40000+
181206 422209 4431 0 2 40000+
num_vehicles
0 1.0
1 1.0
90603 2.0
90604 2.0
181206 2.0
I expected to see sequential non-duplicate repeat indices for tract
, serial_number
pairs in households, e.g. the repeat indices column would be 0,1,2,3,4
TypeError Traceback (most recent call last)
in
----> 1 allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)
~/anaconda3/lib/python3.7/site-packages/doppelganger/allocation.py in from_cleaned_data(marginals, households_data, persons_data)
77 households_data.data, persons_data.data)
78 allocated_households, allocated_persons =
---> 79 HouseholdAllocator._allocate_households(households, persons, marginals)
80 return HouseholdAllocator(allocated_households, allocated_persons)
81
~/anaconda3/lib/python3.7/site-packages/doppelganger/allocation.py in _allocate_households(households, persons, tract_controls)
174
175 hh_weights = balance_multi_cvx(
--> 176 hh_table, A, B, w_extend, gamma * mu_extend.T, meta_gamma
177 )
178
~/anaconda3/lib/python3.7/site-packages/doppelganger/listbalancer.py in balance_multi_cvx(hh_table, A, B, w, mu, meta_mu, verbose_solver)
121
122 n_tracts = w.shape[0]
--> 123 x = cvx.Variable(n_tracts, n_samples)
124
125 # Relative weights of tracts
~/anaconda3/lib/python3.7/site-packages/cvxpy/expressions/variable.py in init(self, shape, name, var_id, **kwargs)
73 self._name = name
74 else:
---> 75 raise TypeError("Variable name %s must be a string." % name)
76
77 self._value = None
TypeError: Variable name 917 must be a string.
Right now households are unique by the (tract, serialno, repeatno) tuple. They should just have a unique ID
Ensure that the possible bins from the variables coming from inputs.py used in the solver align with their marginal equivalents. This includes non-override checks coming out of config.
Running into the following error when trying to train the bayesnet from the simple example notebook.
194 # Make defensive copy
195 data = list(data) + list(prior_data)
--> 196 bayesian_network = BayesianNetwork.from_structure(data, structure)
197 type_to_network[type_] = bayesian_network
198 return BayesianNetworkModel(type_to_network, fields, segmenter=input_data.segmenter)
pomegranate/BayesianNetwork.pyx in pomegranate.BayesianNetwork.BayesianNetwork.from_structure()
TypeError: unsupported operand type(s) for +: 'frozenset' and 'tuple'
Seems to be coming from the following line in pomegranate's BayesianNetwork.pyx -- line 820
nodes[i] = ConditionalProbabilityTable.from_samples(X[:,parents+(i,)],
parents=[nodes[parent] for parent in parents],
weights=weights, pseudocount=pseudocount)
Where i and parents are directly from Doppelganger's configuration.person_structure,
for i, parents in enumerate(structure):
Anyone else come across this one? I've tried upgrading pomegranate to 0.7.7 and installing from source (ignoring pip's cache) and still get the same issue.
Setup: OSX Yosemite, pomegranate 0.7.1
In dataframes, columns corresponding to state and puma codes should preserve the leading zeros e.g. 00106 becomes 106 when pandas automatically infers numeric types
As encountered with doppelganger_example_simple.ipynb
, this can lead to wrong filter results in dataframes e.g. in lines 31-34 of doppelganger/datasource.py
cleaned_data = cleaned_data[
(cleaned_data[inputs.STATE.name].astype(str) == str(state)) &
(cleaned_data[inputs.PUMA.name].astype(str) == str(puma))
]
the left-hand-side can become 106 (due to pandas auto dtype inference) whereas the right-hand-side (passed by the user) is '00106'.
As remarked in #59, there should be a general strategy to control the columns dtypes.
Hi
I'm using the sample data and trying to run the example provided in doppelganger_example_full.ipynb. However, I get this error and cannot figure out what's the problem. Can you please help me with it?
allocator = HouseholdAllocator.from_cleaned_data(controls, households_data, persons_data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/doppelganger/allocation.py", line 77, in from_cleaned_data
households_data.data, persons_data.data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/doppelganger/allocation.py", line 212, in _format_data
._str_broadcast(inputs.AGE.name, list(inputs.AGE.possible_values))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 4385, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 645, in _set_axis
self._data.set_axis(axis, labels)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/internals.py", line 3323, in set_axis
'values have {new} elements'.format(old=old_len, new=new_len))
ValueError: Length mismatch: Expected axis has 0 elements, new values have 4 elements
nosetests do not pass
using Python 3.6.3
Hi everyone,
We are currently using doppelganger for our own set of data in our region. The example is working for us when we use our own data and the generated household table is exactly what we need. The only problem is that we do need discrete numbers for some of the categories, in our case it would be household_income and num_people (some of the values are categorical but we would need specific numbers).
We have downloaded the most recent version of doppelganger and been using it via Jupyter Notebook. In the doppelganger full example it mentions accessing inputs.py to make adjustments to output variables. After modifying the inputs.py file and running the example we noticed the outputs do not change at all. Are we suppose to modify the inputs.py file within the download we have or is there another inputs.py that we should be working with?
To clarify, we have our doppelganger location at 'C:\Users\someUser\doppelgangerCU' and we've been modifying the inputs at 'C:\Users\someUser\doppelgangerCU\doppelganger\inputs.py'.
We'd appreciate any help, thanks!
Traceback (most recent call last):
File "doppelganger/scripts/download_allocate_generate.py", line 332, in
main()
File "doppelganger/scripts/download_allocate_generate.py", line 317, in main
person_segmenter, household_segmenter
File "doppelganger/scripts/download_allocate_generate.py", line 181, in create_bayes_net
household_model.write(household_model_filename)
File "/Users/six/code/doppelganger/doppelganger/bayesnets.py", line 92, in write
json_string = self.to_json()
File "/Users/six/code/doppelganger/doppelganger/bayesnets.py", line 100, in to_json
return json.dumps(blob, indent=4, sort_keys=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 250, in dumps
sort_keys=sort_keys, **kw).encode(obj)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 209, in encode
chunks = list(chunks)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 434, in _iterencode
for chunk in _iterencode_dict(o, _current_indent_level):
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 408, in _iterencode_dict
for chunk in chunks:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/encoder.py", line 361, in _iterencode_dict
items = sorted(dct.items(), key=lambda kv: kv[0])
File "/Library/Python/2.7/site-packages/future-0.16.0-py2.7.egg/future/types/newstr.py", line 316, in gt
raise TypeError(self.unorderable_err.format(type(other)))
TypeError: unorderable types: str() and <type 'NoneType'>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.