udst / bayarea_urbansim Goto Github PK
View Code? Open in Web Editor NEWUrbanSim implementation for the San Francisco Bay Area
UrbanSim implementation for the San Francisco Bay Area
this query
has a very costly query plan:
mtc=> EXPLAIN SELECT * into stacked FROM parcels
mtc-> where geom in (select geom from parcels
mtc(> group by geom having count(*) > 1);
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Nested Loop Semi Join (cost=1845244.87..369729649373.73 rows=2527 width=510)
Join Filter: (parcels.geom = parcels_1.geom)
-> Seq Scan on parcels (cost=0.00..1196542.38 rows=2526738 width=510)
-> Materialize (cost=1845244.87..2042251.70 rows=2526738 width=327)
-> GroupAggregate (cost=1845244.87..1895779.63 rows=2526738 width=327)
Filter: (count(*) > 1)
-> Sort (cost=1845244.87..1851561.71 rows=2526738 width=327)
Sort Key: parcels_1.geom
-> Seq Scan on parcels parcels_1 (cost=0.00..1196542.38 rows=2526738 width=327)
(9 rows)
we can probably use filters and postgis st_equals here:
for example, like this:
http://stackoverflow.com/questions/18769250/finding-multiple-duplicates-in-postgres
I think UDST/urbansim PR #172 creates a wrinkle in UDST/bayarea_urbansim that needs to be ironed out.
Previously, UDST/bayarea_urbansim ran fine for me, using the return-on-cost
branch of UDST/urbansim. PR #172 moved the return-on-cost functionality into the master branch (I think), but now the UDST/bayarea_urbansim alt_feasibility()
step crashes with a KeyError on 'max_far_from_dua'.
https://github.com/UDST/bayarea_urbansim/blob/master/baus/models.py#L221-L225
https://github.com/UDST/bayarea_urbansim/blob/master/configs/settings.yaml#L301-L322
I'm not that familiar with the feasibility code, but am happy to dig into it if I'm the only one affected by this. Probably we just need to change how the options are being passed into UrbanSim.
@fscottfoti, does anything jump out here?
Many thanks!
@janowicz: So, I think I may be getting an idea of devtypes and devids and where they all get set up in spandex. And I think I found "the problem" at:
https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L780-L781
The crosswalk coding here deviates from the codes I set up. I think this wrong unless itβs been done consistently this other way somewhere else. But Fletcher did mention seeing a lot of transport parcels (which I think are supposed to be vacant ones). So the correct look up to match the codes I specified is (others look right):
VA 21
PG 22
PL 23
TR 24
LD 25
I'm a little scared of trying to change anything bc spandex still confuses me a bit. Thanks!
Running model 'feasibility'
Describe of the yearly rent by use
retail industrial office residential
count 1513115.000000 1513115.000000 1513115.000000 1513114.000000
mean 23.634673 9.438128 23.800669 17.041963
std 6.210531 3.306533 5.192954 7.530383
min 0.000000 0.000000 0.000000 0.000000
25% 22.386634 10.095661 24.424356 10.611475
50% 24.544132 10.520760 24.912241 16.273518
75% 26.816578 10.879598 25.315971 22.032869
max 34.986435 17.485878 42.051506 109.430688
/home/aksel/env/lib/python2.7/site-packages/pandas/util/decorators.py:53: FutureWarning: cols is deprecated, use subset instead
warnings.warn(msg, FutureWarning)
/home/aksel/env/lib/python2.7/site-packages/pandas/core/frame.py:1706: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
"DataFrame index.", UserWarning)
Computing feasibility for form mixedoffice
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-6c4a62baad2c> in <module>()
32 "travel_model_output", # create the output for the travel model
33 "clear_cache" # clear the cache each year
---> 34 ], years=range(in_year, out_year))
35 print "Finished", time.ctime()
/home/aksel/anaconda/lib/python2.7/site-packages/urbansim/sim/simulation.pyc in run(models, years, data_out, out_interval)
1321 model = get_model(model_name)
1322 t2 = time.time()
-> 1323 model()
1324 print("Time to execute model '{}': {:.2f}s".format(
1325 model_name, time.time()-t2))
/home/aksel/anaconda/lib/python2.7/site-packages/urbansim/sim/simulation.pyc in __call__(self)
625 with log_start_finish('calling model {!r}'.format(self.name), logger):
626 kwargs = _collect_injectables(self._arg_list)
--> 627 return self._func(**kwargs)
628
629 def _tables_used(self):
/home/aksel/env/bayarea_urbansim/models.pyc in feasibility(parcels)
167 pass_through=["oldest_building", "total_sqft",
168 "max_far", "max_dua", "land_cost",
--> 169 "residential"])
170
171
/home/aksel/env/bayarea_urbansim/utils.pyc in run_feasibility(parcels, parcel_price_callback, parcel_use_allowed_callback, residential_to_yearly, historic_preservation, config, pass_through)
299 print "Computing feasibility for form %s" % form
300 d[form] = pf.lookup(form, df[parcel_use_allowed_callback(form)],
--> 301 pass_through=pass_through)
302 if residential_to_yearly and "residential" in pass_through:
303 d[form]["residential"] /= pf.config.cap_rate
TypeError: lookup() got an unexpected keyword argument 'pass_through'
In [ ]:
it might be easier to debug data regeneration tasks if we were to separate them by their dependencies.
it seems that there are loading tasks, sql tasks, and then imputation/simulation tasks. respectively, the dependencies for these are: gdal, postgis, urbansim. i am still unclear on whether the urbansim tasks require postgis or gdal operations. if they do not, it might make debugging easier to break gdal/postgis work out from urbansim data work.
Logging this one @fscottfoti : Per the diagnostic outputs from simexplorer for small area price developments over time, many areas exhibit spikes for some years (always peaks at 2000), but not all years; typically the spike is for just one year with abnormally higher prices, before heading down to more "normal" levels.
@tombuckley @waddell @mkreilly Starting a new location to talk about UAL now
data regeneration runs on urbansim 1.3 while simulation requires edge urbansim
After running a new estimation, it would be nice if it were easy to compare what the configs looked like more readily. And example of this comparison/view can be found in the first changed file in this commit:
BayAreaMetro@afb7999#diff-e3d467e6cf0fd6da94204192e1bebb4b
Where do I look to see how the configs are output?
the accessory_units
model step reads from this csv and slices the csv using the simulation year (iter_var
) as the column index. Since the csv has been hardcoded with field names from 2010 to 2040 by fives, if a different EVERY_NTH_YEAR
param is set in baus.py, or if different IN_YEAR
or OUT_YEAR
params are specified, the accessory_units
step fails.
Use requirements file to tie the current version of bayarea_urbansim to the right versions of urbansim and urbansim_defaults (and maybe pandana)
I get a serialization error on travel_model_output
, running with year 2010, running anaconda on W7 64 bit.
print "Started", time.ctime()
in_year, out_year = 2010, 2030
# one time for base year indicators
sim.run([
"diagnostic_output", # create diagnostic indicators
"travel_model_output", # create the output for the travel model
], years=[in_year])
print "Finished", time.ctime()
Started Mon Oct 06 21:32:35 2014
Running year 2010
Running model 'diagnostic_output'
Filling column building_type_id with value 1.0 (0 values)
Filling column residential_units with value 0 (0 values)
Filling column year_built with value 1964.0 (233714 values)
Filling column non_residential_sqft with value 0 (16531 values)
Time to execute model 'diagnostic_output': 22.78s
Running model 'travel_model_output'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-a18bfc61e2d4> in <module>()
7 "diagnostic_output", # create diagnostic indicators
8 "travel_model_output", # create the output for the travel model
----> 9 ], years=[in_year])
10
11 print "Finished", time.ctime()
C:\Anaconda\lib\site-packages\urbansim-1.2dev-py2.7.egg\urbansim\sim\simulation.pyc in run(models, years, data_out, out_interval)
1321 model = get_model(model_name)
1322 t2 = time.time()
-> 1323 model()
1324 print("Time to execute model '{}': {:.2f}s".format(
1325 model_name, time.time()-t2))
C:\Anaconda\lib\site-packages\urbansim-1.2dev-py2.7.egg\urbansim\sim\simulation.pyc in __call__(self)
625 with log_start_finish('calling model {!r}'.format(self.name), logger):
626 kwargs = _collect_injectables(self._arg_list)
--> 627 return self._func(**kwargs)
628
629 def _tables_used(self):
C:\cygwin64\home\aolsen\projects\bayarea_urbansim\models.pyc in travel_model_output(households, jobs, buildings, zones, year)
314 utils.add_simulation_output(zones, "travel_model_outputs", year)
315 utils.write_simulation_output(os.path.join(misc.runs_dir(),
--> 316 "run{}_simulation_output.json"))
317 utils.write_parcel_output(os.path.join(misc.runs_dir(),
318 "run{}_parcel_output.csv"))
C:\cygwin64\home\aolsen\projects\bayarea_urbansim\utils.pyc in write_simulation_output(outname)
475 outname = outname.format(sim.get_injectable("run_number"))
476 outf = open(outname, "w")
--> 477 json.dump(d, outf)
478 outf.close()
479
C:\Anaconda\lib\json\__init__.pyc in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
187 # could accelerate with writelines in some versions of Python, at
188 # a debuggability cost
--> 189 for chunk in iterable:
190 fp.write(chunk)
191
C:\Anaconda\lib\json\encoder.pyc in _iterencode(o, _current_indent_level)
432 yield chunk
433 elif isinstance(o, dict):
--> 434 for chunk in _iterencode_dict(o, _current_indent_level):
435 yield chunk
436 else:
C:\Anaconda\lib\json\encoder.pyc in _iterencode_dict(dct, _current_indent_level)
406 else:
407 chunks = _iterencode(value, _current_indent_level)
--> 408 for chunk in chunks:
409 yield chunk
410 if newline_indent is not None:
C:\Anaconda\lib\json\encoder.pyc in _iterencode_list(lst, _current_indent_level)
330 else:
331 chunks = _iterencode(value, _current_indent_level)
--> 332 for chunk in chunks:
333 yield chunk
334 if newline_indent is not None:
C:\Anaconda\lib\json\encoder.pyc in _iterencode(o, _current_indent_level)
440 raise ValueError("Circular reference detected")
441 markers[markerid] = o
--> 442 o = _default(o)
443 for chunk in _iterencode(o, _current_indent_level):
444 yield chunk
C:\Anaconda\lib\json\encoder.pyc in default(self, o)
182
183 """
--> 184 raise TypeError(repr(o) + " is not JSON serializable")
185
186 def encode(self, o):
TypeError: 1442 is not JSON serializable
Running Estimation.ipynb generates NA's in the configs. For example, here in configs/elcm.yaml:
I mentioned this on Slack, but wanted to write it up here as well.
Something in the last few months has caused HLCM estimation to slow down tremendously. When I re-execute the Estimation.ipynb notebook, the "hlcm_estimate" model that used to complete in 1 minute now takes several hours to converge.
I haven't been able to isolate what's causing this, but Fletcher confirms it's not working for him either.
Before I get started it's worth referring to the next generation developer model, which will solve many of these issues - UDST/urbansim#112. In other words, there's definitely a path to solve this the "right" way but I want to make sure we're aware of the current state of things and any near-term fixes that can be made.
The problem exists on the residential and commercial side in slightly different ways. First, the residential.
The issue is how to select a building_type from a land use. So we know the use is residential, we have an average price/sqft for residential and have density limits either through height, FAR, or DUA. So we get a "residential" building out at a certain density, considering the zoning, the costs of building at different heights, etc. The easiest way to map the land use to a building type - what we do now - is just to map based on density. So single family is a DUA < 12, townhomes are 12 < DUA < 24, and multi-family is DUA > 24. What I've noticed is that the current zoning I'm looking at doesn't have a lot of DUAs less than 12, so we're not getting a lot of HS out of the developer model. Anyway, this is working as designed, but the design might need to be improved. But for starters I'm just surprised there isn't more restrictive zoning even in very suburban areas.
On the commercial side, the problem is even more onerous. We get control totals here by sector, and we have sqft by building which we respect, and then we choose to develop commercial based on the right prices for the land use, and respect the sqft per building type, but we don't really have a mechanism to turn the employment control totals into demand for sqft of a certain land use. Do we really want to capture substitution of building types by a sector - or would we prefer to map sectors to a building type distribution for that sector (since there's only 6 high-level sectors) and thus get control totals for new commercial development by land use (the same way we have on the residential side) and then we can pick from profitable developments to meet that amount of development. Until then we get mostly office development because it gets the highest rents and then put sectors in office buildings that probably shouldn't be there. This is also the reason why we don't get retail at appropriate rates with the appropriate distributions. Of course retail location choice is a HUGE problem even on its own and we can leave that for another day.
I added a starter placeholder to create a model to do the travel model export:
https://github.com/synthicity/bayarea_urbansim/blob/master/models.py#L240
This should be in your models.py. I edit that code in pycharm/sublime and then execute it from within the notebook. I added a new notebook called Travel Model Output which you can use to execute the model and look at the table that gets written afterwards.
Within the model I call households.to_frame() and zones.to_frame() so everything that happens after that is straight Python. We should go through the variables one-by-one and add them to the dataframe - ask lots of questions as you do it and I can answer them. Responding on github might be a good way to keep all the information together in one place.
For starters, I'm using the sanfran.h5 data (San Francisco data only) so there are only households in 189 or so zones. One thing to realize right away is that Pandas keeps nan values for all the zones that it doesn't have households in. I imagine we want to fillna(0) for those cells. The easiest way to do that is probably zones.HHINC1 = zones.HHINC1.fillna(0)
but there are other ways too.
Is there a nice way to estimate a model using a random subset of data? I thought this might be a way around issue #65, but can't figure out how to do it.
I tried adding expressions to the yaml choosers_fit_filters
like
np.random.random() < 0.05
ornp.mod(unit_id, 100) < 5
But these cause errors. And I can't seem to do it in the @sim.model
expression because the table needs to be passed as a DataFrameWrapper.
Thanks for any advice!
There is an open-ended question about how to version outputs. Currently, we version outputs by default using the ipython notebook output cells. We assume that whatever notebook is more recent is the primary one and we delete the previous one.
However, comparing BLOBs of outputs in JSON is difficult, so it may not be possible to meaningfully compare outputs across notebooks. If we can't meaningfully compare outputs across notebooks, then why would we version the notebook with the outputs in it?
Is the goal to version the output cells, or the input cells? Or both?
Here are a few links to discussion on how other people have thought about versioning for ipython notebooks.
https://ipython.org/ipython-doc/stable/interactive/tips.html#lightweight-version-control
@jiffyclub I'm hoping you can review what I'm currently doing in the BayArea implementation of UrbanSim and
Anyway, I'm definitely happy with models.py, variables.py and assumptions.py - the real questionable code is in
https://github.com/synthicity/bayarea_urbansim/blob/master/utils.py
and
https://github.com/synthicity/bayarea_urbansim/blob/master/models.py
One point is on relocating agents. I've actually gone back to using -1 to mark the relocating agents. np.nan coerces the building_id into a float64 dtype, which wreaks havoc on performance with all the joins by building_id to other tables (this was a fun bug to track down). I know I could just record the indexes of the moving agents and carry that from the relocation model to the location model, but I use the -1's to mark the moving agents just to make sure I'm updating the data correctly. Otherwise I think it would be much easier to hide the fact that I'm doing the update wrong (which has happened a couple of times), since all agents would always have a valid building_id, and in lieu of unit testing this I think the -1's and logging is the best option to convince me it's working. Something to think about though.
And one other question is on adding records - like in the developer model and the transition models - is just calling add_table at the end the canonical way to update a dataframe or does this have consequences?
Is there a mismatch between data regeneration and what nrh_estimate as run in Estimation.py (or Estimation.ipynb) expects? I get the following error after running data regeneration and then Estimation.py.
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.87e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Time to execute model 'rsh_estimate': 11.39s
Total time to execute: 11.39s
Running model 'nrh_estimate'
Traceback (most recent call last):
File "Estimation.py", line 29, in <module>
sim.run(["nrh_estimate"])
File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 1526, in run
model()
File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 680, in __call__
expressions=self._argspec.defaults)
File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 827, in _collect_variables
variables[label] = thing()
File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 419, in __call__
return self._call_func()
File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 404, in _call_func
frame = self._func(**kwargs)
File "/vm_project_dir/bayarea_urbansim/datasources.py", line 61, in costar
df = store['costar']
File "/home/vagrant/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 410, in __getitem__
return self.get(key)
File "/home/vagrant/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 619, in get
raise KeyError('No object named %s in the file' % key)
KeyError: 'No object named costar in the file'
Closing remaining open files:./data/osm_bayarea4326.h5...done./data/bayarea_v3.h5...done
@fscottfoti @mkreilly @janowicz, we've talked a lot about this feature, so it makes sense to put some notes here.
@fscottfoti's current thinking is that hashes of centroids for the parcels should be unique. And that because that is the case, the user will always be able to say whether a parcel from a given run is identical to a parcel from another run.
But I'm still a bit unclear on how to write the story
for this feature, so I don't know how we will say when it is complete.
I think the story is: as a person that is modeling the state of parcels over time, i would like to be able to say whether any given parcel that i am describing is identical to another parcel, so that i can improve the quality of the predictions that i am making about the state of that (all?) parcel(s).
It seems that one issue was that a user would assign attributes to a parcel at some point in the modeling process, and they they would later try to apply those attributes to another set of parcels and be unable to do so because the parcel table had changed, and therefore unique identifiers changed. @mkreilly could you clarify on what percentage difference or similarity would be acceptable when joining parcels across tables? That might help us define what the successful completion of this story is like.
One previous attempt at keeping a parcel's ID the same was to keep an ID column on the table that had a unique name which was generated in some early process, and then just make sure that that ID column remained on the table in all cases where parcels were used in the modeling process.
Another approach is to use the hash of the geometry column. For example. However, when we compared the geom_id's from @janowicz's (Windows 7) laptop to those generated by the MTC Windows Server 2012, only 1/3 of the parcels were exactly identical. On the other hand, across Linux machines built in exactly the same way, more than 95% of the geometry ID's are identical.
Other ideas for keeping a parcel's ID the same include using a geohash or similar.
There might be 2 notions of time that are relevant: parcel time and database time. For example, lets assume that parcel A that has an attribute something=1
at time-1
in the parcel table. If we discover, at time-2
, that we were incorrect, and that in time-1
, parcel A in fact had something = 2
, do we revise the time-1
parcel table? Or do we only resolve the time-2
table? This could be more complicated if something
is actually the geometry of the parcel, or if the parcel splits.
In order to successfully get from source data to outputs, this repository has at least 2 dependencies (GDAL/PostGIS, Anaconda) that are installed at a lower level than a base python installation. Thus far we have been using shell scripts to keep track of these dependencies, for example:
https://github.com/MetropolitanTransportationCommission/bayarea_urbansim_setup/tree/vagrant-ubuntu14
On Windows however, we do not have a straightforward way of debugging these environmental/operating system requirements and configuration. We have discussed keeping track of dependencies in a simple text file.
How would this help us for debugging problems at the OS/environment level? For example, if a windows user needs to completely reinstall Anaconda, what do they need to do to re-configure gdal for Anaconda?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.