udst / bayarea_urbansim Goto Github PK

View Code? Open in Web Editor NEW

14.0 14.0 27.0 65.57 MB

UrbanSim implementation for the San Francisco Bay Area

Python 99.70% Shell 0.30%

bay-area data-science modeling simulation urbansim

bayarea_urbansim's People

Contributors

Stargazers

Watchers

Forkers

bayareametro scdavis50 ual yonran lethphd qwhelan afcarl pheerathano fagan2888 creatune mattyj88 colbybrown urban-foresight-at-ac planwithdata

bayarea_urbansim's Issues

data regeneration: use better filters/spatial queries to identify identical parcel geometries

this query

https://github.com/MetropolitanTransportationCommission/bayarea_urbansim/blob/master/data_regeneration/geom_aggregation.py#L241-L243

has a very costly query plan:

mtc=> EXPLAIN SELECT * into stacked FROM parcels
mtc-> where geom in (select geom from parcels
mtc(> group by geom having count(*) > 1);
                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=1845244.87..369729649373.73 rows=2527 width=510)
   Join Filter: (parcels.geom = parcels_1.geom)
   ->  Seq Scan on parcels  (cost=0.00..1196542.38 rows=2526738 width=510)
   ->  Materialize  (cost=1845244.87..2042251.70 rows=2526738 width=327)
         ->  GroupAggregate  (cost=1845244.87..1895779.63 rows=2526738 width=327)
               Filter: (count(*) > 1)
               ->  Sort  (cost=1845244.87..1851561.71 rows=2526738 width=327)
                     Sort Key: parcels_1.geom
                     ->  Seq Scan on parcels parcels_1  (cost=0.00..1196542.38 rows=2526738 width=327)
(9 rows)

we can probably use filters and postgis st_equals here:

for example, like this:
http://stackoverflow.com/questions/18769250/finding-multiple-duplicates-in-postgres

Incompatibility with UDST/urbansim PR #172

I think UDST/urbansim PR #172 creates a wrinkle in UDST/bayarea_urbansim that needs to be ironed out.

Previously, UDST/bayarea_urbansim ran fine for me, using the return-on-cost branch of UDST/urbansim. PR #172 moved the return-on-cost functionality into the master branch (I think), but now the UDST/bayarea_urbansim alt_feasibility() step crashes with a KeyError on 'max_far_from_dua'.

https://github.com/UDST/bayarea_urbansim/blob/master/baus/models.py#L221-L225
https://github.com/UDST/bayarea_urbansim/blob/master/configs/settings.yaml#L301-L322

I'm not that familiar with the feasibility code, but am happy to dig into it if I'm the only one affected by this. Probably we just need to change how the options are being passed into UrbanSim.

@fscottfoti, does anything jump out here?

Many thanks!

misaligned lookup table values?

@janowicz: So, I think I may be getting an idea of devtypes and devids and where they all get set up in spandex. And I think I found "the problem" at:
https://github.com/synthicity/bayarea_urbansim/blob/master/data_regeneration/match_aggregate.py#L780-L781

The crosswalk coding here deviates from the codes I set up. I think this wrong unless it’s been done consistently this other way somewhere else. But Fletcher did mention seeing a lot of transport parcels (which I think are supposed to be vacant ones). So the correct look up to match the codes I specified is (others look right):
VA 21
PG 22
PL 23
TR 24
LD 25

I'm a little scared of trying to change anything bc spandex still confuses me a bit. Thanks!

Feasibility model fails with TypeError: lookup() got an unexpected keyword argument 'pass_through'

Running model 'feasibility'
Describe of the yearly rent by use
               retail      industrial          office     residential
count  1513115.000000  1513115.000000  1513115.000000  1513114.000000
mean        23.634673        9.438128       23.800669       17.041963
std          6.210531        3.306533        5.192954        7.530383
min          0.000000        0.000000        0.000000        0.000000
25%         22.386634       10.095661       24.424356       10.611475
50%         24.544132       10.520760       24.912241       16.273518
75%         26.816578       10.879598       25.315971       22.032869
max         34.986435       17.485878       42.051506      109.430688
/home/aksel/env/lib/python2.7/site-packages/pandas/util/decorators.py:53: FutureWarning: cols is deprecated, use subset instead
  warnings.warn(msg, FutureWarning)
/home/aksel/env/lib/python2.7/site-packages/pandas/core/frame.py:1706: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)

Computing feasibility for form mixedoffice
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6c4a62baad2c> in <module>()
     32     "travel_model_output",       # create the output for the travel model
     33     "clear_cache"                # clear the cache each year
---> 34 ], years=range(in_year, out_year))
     35 print "Finished", time.ctime()

/home/aksel/anaconda/lib/python2.7/site-packages/urbansim/sim/simulation.pyc in run(models, years, data_out, out_interval)
   1321                 model = get_model(model_name)
   1322                 t2 = time.time()
-> 1323                 model()
   1324                 print("Time to execute model '{}': {:.2f}s".format(
   1325                       model_name, time.time()-t2))

/home/aksel/anaconda/lib/python2.7/site-packages/urbansim/sim/simulation.pyc in __call__(self)
    625         with log_start_finish('calling model {!r}'.format(self.name), logger):
    626             kwargs = _collect_injectables(self._arg_list)
--> 627             return self._func(**kwargs)
    628 
    629     def _tables_used(self):

/home/aksel/env/bayarea_urbansim/models.pyc in feasibility(parcels)
    167                           pass_through=["oldest_building", "total_sqft",
    168                                         "max_far", "max_dua", "land_cost",
--> 169                                         "residential"])
    170 
    171 

/home/aksel/env/bayarea_urbansim/utils.pyc in run_feasibility(parcels, parcel_price_callback, parcel_use_allowed_callback, residential_to_yearly, historic_preservation, config, pass_through)
    299         print "Computing feasibility for form %s" % form
    300         d[form] = pf.lookup(form, df[parcel_use_allowed_callback(form)],
--> 301                             pass_through=pass_through)
    302         if residential_to_yearly and "residential" in pass_through:
    303             d[form]["residential"] /= pf.config.cap_rate

TypeError: lookup() got an unexpected keyword argument 'pass_through'

In [ ]:

Confusing as a Service: debugging data regeneration processing tasks with different dependencies

it might be easier to debug data regeneration tasks if we were to separate them by their dependencies.

it seems that there are loading tasks, sql tasks, and then imputation/simulation tasks. respectively, the dependencies for these are: gdal, postgis, urbansim. i am still unclear on whether the urbansim tasks require postgis or gdal operations. if they do not, it might make debugging easier to break gdal/postgis work out from urbansim data work.

Price model: spikes in future years for many

Logging this one @fscottfoti : Per the diagnostic outputs from simexplorer for small area price developments over time, many areas exhibit spikes for some years (always peaks at 2000), but not all years; typically the spike is for just one year with abnormally higher prices, before heading down to more "normal" levels.

getting changes from UAL

@tombuckley @waddell @mkreilly Starting a new location to talk about UAL now

documentation should specify that the requirements of data regeneration are different from those of simulation

data regeneration runs on urbansim 1.3 while simulation requires edge urbansim

How are configs written out by Estimation.ipynb?

After running a new estimation, it would be nice if it were easy to compare what the configs looked like more readily. And example of this comparison/view can be found in the first changed file in this commit:

BayAreaMetro@afb7999#diff-e3d467e6cf0fd6da94204192e1bebb4b

Where do I look to see how the configs are output?

issues related to the update to the new microdata

apn vs parcel_id
need spatial relationship to zoningmodcat
need spatial relationship to general plan data

accessory_units model step fails for simulation years not in 2010, 2015, 2020, etc.

the accessory_units model step reads from this csv and slices the csv using the simulation year (iter_var) as the column index. Since the csv has been hardcoded with field names from 2010 to 2040 by fives, if a different EVERY_NTH_YEAR param is set in baus.py, or if different IN_YEAR or OUT_YEAR params are specified, the accessory_units step fails.

requirements file

Use requirements file to tie the current version of bayarea_urbansim to the right versions of urbansim and urbansim_defaults (and maybe pandana)

json serialization TypeError: 1442 is not JSON serializable

I get a serialization error on travel_model_output, running with year 2010, running anaconda on W7 64 bit.

  print "Started", time.ctime()
  in_year, out_year = 2010, 2030

 # one time for base year indicators
sim.run([
    "diagnostic_output",         # create diagnostic indicators
    "travel_model_output",       # create the output for the travel model
], years=[in_year])

print "Finished", time.ctime()
Started Mon Oct 06 21:32:35 2014
Running year 2010
Running model 'diagnostic_output'
Filling column building_type_id with value 1.0 (0 values)
Filling column residential_units with value 0 (0 values)
Filling column year_built with value 1964.0 (233714 values)
Filling column non_residential_sqft with value 0 (16531 values)
Time to execute model 'diagnostic_output': 22.78s
Running model 'travel_model_output'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-a18bfc61e2d4> in <module>()
      7     "diagnostic_output",         # create diagnostic indicators
      8     "travel_model_output",       # create the output for the travel model
----> 9 ], years=[in_year])
     10 
     11 print "Finished", time.ctime()

C:\Anaconda\lib\site-packages\urbansim-1.2dev-py2.7.egg\urbansim\sim\simulation.pyc in run(models, years, data_out, out_interval)
   1321                 model = get_model(model_name)
   1322                 t2 = time.time()
-> 1323                 model()
   1324                 print("Time to execute model '{}': {:.2f}s".format(
   1325                       model_name, time.time()-t2))

C:\Anaconda\lib\site-packages\urbansim-1.2dev-py2.7.egg\urbansim\sim\simulation.pyc in __call__(self)
    625         with log_start_finish('calling model {!r}'.format(self.name), logger):
    626             kwargs = _collect_injectables(self._arg_list)
--> 627             return self._func(**kwargs)
    628 
    629     def _tables_used(self):

C:\cygwin64\home\aolsen\projects\bayarea_urbansim\models.pyc in travel_model_output(households, jobs, buildings, zones, year)
    314     utils.add_simulation_output(zones, "travel_model_outputs", year)
    315     utils.write_simulation_output(os.path.join(misc.runs_dir(),
--> 316                                                "run{}_simulation_output.json"))
    317     utils.write_parcel_output(os.path.join(misc.runs_dir(),
    318                                            "run{}_parcel_output.csv"))

C:\cygwin64\home\aolsen\projects\bayarea_urbansim\utils.pyc in write_simulation_output(outname)
    475     outname = outname.format(sim.get_injectable("run_number"))
    476     outf = open(outname, "w")
--> 477     json.dump(d, outf)
    478     outf.close()
    479 

C:\Anaconda\lib\json\__init__.pyc in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, sort_keys, **kw)
    187     # could accelerate with writelines in some versions of Python, at
    188     # a debuggability cost
--> 189     for chunk in iterable:
    190         fp.write(chunk)
    191 

C:\Anaconda\lib\json\encoder.pyc in _iterencode(o, _current_indent_level)
    432                 yield chunk
    433         elif isinstance(o, dict):
--> 434             for chunk in _iterencode_dict(o, _current_indent_level):
    435                 yield chunk
    436         else:

C:\Anaconda\lib\json\encoder.pyc in _iterencode_dict(dct, _current_indent_level)
    406                 else:
    407                     chunks = _iterencode(value, _current_indent_level)
--> 408                 for chunk in chunks:
    409                     yield chunk
    410         if newline_indent is not None:

C:\Anaconda\lib\json\encoder.pyc in _iterencode_list(lst, _current_indent_level)
    330                 else:
    331                     chunks = _iterencode(value, _current_indent_level)
--> 332                 for chunk in chunks:
    333                     yield chunk
    334         if newline_indent is not None:

C:\Anaconda\lib\json\encoder.pyc in _iterencode(o, _current_indent_level)
    440                     raise ValueError("Circular reference detected")
    441                 markers[markerid] = o
--> 442             o = _default(o)
    443             for chunk in _iterencode(o, _current_indent_level):
    444                 yield chunk

C:\Anaconda\lib\json\encoder.pyc in default(self, o)
    182 
    183         """
--> 184         raise TypeError(repr(o) + " is not JSON serializable")
    185 
    186     def encode(self, o):

TypeError: 1442 is not JSON serializable

Estimation.ipynb generates NA's in config parameters

Running Estimation.ipynb generates NA's in the configs. For example, here in configs/elcm.yaml:

https://github.com/MetropolitanTransportationCommission/bayarea_urbansim/blob/v0.2alpha/configs/elcm.yaml#L110-L112

Slow HLCM estimation

I mentioned this on Slack, but wanted to write it up here as well.

Something in the last few months has caused HLCM estimation to slow down tremendously. When I re-execute the Estimation.ipynb notebook, the "hlcm_estimate" model that used to complete in 1 minute now takes several hours to converge.

I haven't been able to isolate what's causing this, but Fletcher confirms it's not working for him either.

building type selection

Before I get started it's worth referring to the next generation developer model, which will solve many of these issues - UDST/urbansim#112. In other words, there's definitely a path to solve this the "right" way but I want to make sure we're aware of the current state of things and any near-term fixes that can be made.

The problem exists on the residential and commercial side in slightly different ways. First, the residential.

The issue is how to select a building_type from a land use. So we know the use is residential, we have an average price/sqft for residential and have density limits either through height, FAR, or DUA. So we get a "residential" building out at a certain density, considering the zoning, the costs of building at different heights, etc. The easiest way to map the land use to a building type - what we do now - is just to map based on density. So single family is a DUA < 12, townhomes are 12 < DUA < 24, and multi-family is DUA > 24. What I've noticed is that the current zoning I'm looking at doesn't have a lot of DUAs less than 12, so we're not getting a lot of HS out of the developer model. Anyway, this is working as designed, but the design might need to be improved. But for starters I'm just surprised there isn't more restrictive zoning even in very suburban areas.

On the commercial side, the problem is even more onerous. We get control totals here by sector, and we have sqft by building which we respect, and then we choose to develop commercial based on the right prices for the land use, and respect the sqft per building type, but we don't really have a mechanism to turn the employment control totals into demand for sqft of a certain land use. Do we really want to capture substitution of building types by a sector - or would we prefer to map sectors to a building type distribution for that sector (since there's only 6 high-level sectors) and thus get control totals for new commercial development by land use (the same way we have on the residential side) and then we can pick from profitable developments to meet that amount of development. Until then we get mostly office development because it gets the highest rents and then put sectors in office buildings that probably shouldn't be there. This is also the reason why we don't get retail at appropriate rates with the appropriate distributions. Of course retail location choice is a HUGE problem even on its own and we can leave that for another day.

travel model export

I added a starter placeholder to create a model to do the travel model export:

https://github.com/synthicity/bayarea_urbansim/blob/master/models.py#L240

This should be in your models.py. I edit that code in pycharm/sublime and then execute it from within the notebook. I added a new notebook called Travel Model Output which you can use to execute the model and look at the table that gets written afterwards.

Within the model I call households.to_frame() and zones.to_frame() so everything that happens after that is straight Python. We should go through the variables one-by-one and add them to the dataframe - ask lots of questions as you do it and I can answer them. Responding on github might be a good way to keep all the information together in one place.

For starters, I'm using the sanfran.h5 data (San Francisco data only) so there are only households in 189 or so zones. One thing to realize right away is that Pandas keeps nan values for all the zones that it doesn't have households in. I imagine we want to fillna(0) for those cells. The easiest way to do that is probably zones.HHINC1 = zones.HHINC1.fillna(0) but there are other ways too.

Estimating a model with random subset of data?

Is there a nice way to estimate a model using a random subset of data? I thought this might be a way around issue #65, but can't figure out how to do it.

I tried adding expressions to the yaml choosers_fit_filters like

np.random.random() < 0.05 or
np.mod(unit_id, 100) < 5

But these cause errors. And I can't seem to do it in the @sim.model expression because the table needs to be passed as a DataFrameWrapper.

Thanks for any advice!

version control for outputs (ipython notebook versions?)

There is an open-ended question about how to version outputs. Currently, we version outputs by default using the ipython notebook output cells. We assume that whatever notebook is more recent is the primary one and we delete the previous one.

However, comparing BLOBs of outputs in JSON is difficult, so it may not be possible to meaningfully compare outputs across notebooks. If we can't meaningfully compare outputs across notebooks, then why would we version the notebook with the outputs in it?

Is the goal to version the output cells, or the input cells? Or both?

Here are a few links to discussion on how other people have thought about versioning for ipython notebooks.

https://ipython.org/ipython-doc/stable/interactive/tips.html#lightweight-version-control

ipython/ipython#8009

bayarea code review

@jiffyclub I'm hoping you can review what I'm currently doing in the BayArea implementation of UrbanSim and

make sure there's not a more elegant way to do be doing things
make sure we don't want to move any of these methods into UrbanSim
think about doing more functional testing - like, maybe we can run a simulation using the code in a repo like this on only 300 or so parcels (VERY small hdf5) or so just to keep things humming (maybe even build badge and coverage badge) - or more realistically we could run this on a public example implementation repo

Anyway, I'm definitely happy with models.py, variables.py and assumptions.py - the real questionable code is in

https://github.com/synthicity/bayarea_urbansim/blob/master/utils.py
and
https://github.com/synthicity/bayarea_urbansim/blob/master/models.py

One point is on relocating agents. I've actually gone back to using -1 to mark the relocating agents. np.nan coerces the building_id into a float64 dtype, which wreaks havoc on performance with all the joins by building_id to other tables (this was a fun bug to track down). I know I could just record the indexes of the moving agents and carry that from the relocation model to the location model, but I use the -1's to mark the moving agents just to make sure I'm updating the data correctly. Otherwise I think it would be much easier to hide the fact that I'm doing the update wrong (which has happened a couple of times), since all agents would always have a valid building_id, and in lieu of unit testing this I think the -1's and logging is the best option to convince me it's working. Something to think about though.

And one other question is on adding records - like in the developer model and the transition models - is just calling add_table at the end the canonical way to update a dataframe or does this have consequences?

data regeneration->estimation.py->nrh_estimate: no object named costar in the file

Is there a mismatch between data regeneration and what nrh_estimate as run in Estimation.py (or Estimation.ipynb) expects? I get the following error after running data regeneration and then Estimation.py.

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.87e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Time to execute model 'rsh_estimate': 11.39s
Total time to execute: 11.39s
Running model 'nrh_estimate'
Traceback (most recent call last):
  File "Estimation.py", line 29, in <module>
    sim.run(["nrh_estimate"])
  File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 1526, in run
    model()
  File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 680, in __call__
    expressions=self._argspec.defaults)
  File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 827, in _collect_variables
    variables[label] = thing()
  File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 419, in __call__
    return self._call_func()
  File "/vm_project_dir/urbansim/urbansim/sim/simulation.py", line 404, in _call_func
    frame = self._func(**kwargs)
  File "/vm_project_dir/bayarea_urbansim/datasources.py", line 61, in costar
    df = store['costar']
  File "/home/vagrant/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 410, in __getitem__
    return self.get(key)
  File "/home/vagrant/anaconda/lib/python2.7/site-packages/pandas/io/pytables.py", line 619, in get
    raise KeyError('No object named %s in the file' % key)
KeyError: 'No object named costar in the file'
Closing remaining open files:./data/osm_bayarea4326.h5...done./data/bayarea_v3.h5...done

summaries.py fails with KeyError on jobs table

summaries.py looks for the column zone_id_x here, which would hypothetically get created during the preceding merge, but pandas-type suffixes aren't generated by orca.merge_tables() as seen in the docs.

stable identifier for parcels "across runs"

@fscottfoti @mkreilly @janowicz, we've talked a lot about this feature, so it makes sense to put some notes here.

@fscottfoti's current thinking is that hashes of centroids for the parcels should be unique. And that because that is the case, the user will always be able to say whether a parcel from a given run is identical to a parcel from another run.

But I'm still a bit unclear on how to write the story for this feature, so I don't know how we will say when it is complete.

I think the story is: as a person that is modeling the state of parcels over time, i would like to be able to say whether any given parcel that i am describing is identical to another parcel, so that i can improve the quality of the predictions that i am making about the state of that (all?) parcel(s).

It seems that one issue was that a user would assign attributes to a parcel at some point in the modeling process, and they they would later try to apply those attributes to another set of parcels and be unable to do so because the parcel table had changed, and therefore unique identifiers changed. @mkreilly could you clarify on what percentage difference or similarity would be acceptable when joining parcels across tables? That might help us define what the successful completion of this story is like.

One previous attempt at keeping a parcel's ID the same was to keep an ID column on the table that had a unique name which was generated in some early process, and then just make sure that that ID column remained on the table in all cases where parcels were used in the modeling process.

Another approach is to use the hash of the geometry column. For example. However, when we compared the geom_id's from @janowicz's (Windows 7) laptop to those generated by the MTC Windows Server 2012, only 1/3 of the parcels were exactly identical. On the other hand, across Linux machines built in exactly the same way, more than 95% of the geometry ID's are identical.

Other ideas for keeping a parcel's ID the same include using a geohash or similar.

There might be 2 notions of time that are relevant: parcel time and database time. For example, lets assume that parcel A that has an attribute something=1 at time-1 in the parcel table. If we discover, at time-2, that we were incorrect, and that in time-1, parcel A in fact had something = 2, do we revise the time-1 parcel table? Or do we only resolve the time-2 table? This could be more complicated if something is actually the geometry of the parcel, or if the parcel splits.

version control for dependencies

In order to successfully get from source data to outputs, this repository has at least 2 dependencies (GDAL/PostGIS, Anaconda) that are installed at a lower level than a base python installation. Thus far we have been using shell scripts to keep track of these dependencies, for example:

https://github.com/MetropolitanTransportationCommission/bayarea_urbansim_setup/tree/vagrant-ubuntu14

On Windows however, we do not have a straightforward way of debugging these environmental/operating system requirements and configuration. We have discussed keeping track of dependencies in a simple text file.

How would this help us for debugging problems at the OS/environment level? For example, if a windows user needs to completely reinstall Anaconda, what do they need to do to re-configure gdal for Anaconda?