Giter Site home page Giter Site logo

udst / urbansim Goto Github PK

View Code? Open in Web Editor NEW
477.0 477.0 131.0 7.32 MB

Platform for building statistical models of cities and regions

Home Page: https://udst.github.io/urbansim/

License: BSD 3-Clause "New" or "Revised" License

Python 92.35% HTML 7.19% R 0.46%

urbansim's People

Contributors

bridwell avatar conorhenley avatar daradib avatar eh2406 avatar federicofernandez avatar fscottfoti avatar gonzalobenegas avatar hanase avatar janowicz avatar jiffyclub avatar juanshishido avatar msoltadeo avatar pksohn avatar sablanchard avatar smmaurer avatar waddell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

urbansim's Issues

urbansim using outdated version of pandana

urbansim is using outdated version of pandana with ==0.3.0 set inside of setup.py. setup.py could be updated to use latest version or at least enforce version >=0.4.0 as pandana API has changed since 0.3.0 but should not impact its use in urbansim.

Normalize input for Choice Models

A number of times we have accidentally compared the magnitude of coefficients in the yaml files that represent MNLDiscreteChoiceModel instances. This is of course a mistake as 0.001 is a large coefficient for nonres_sqft and a small coefficient for frac_developed. In addition the magick 3's problem; The code puts a hard cut off for coefficients at -3 and 3. This is a grate default for normalized variables i.e. ones with std ~=1 mean ~=0 but way to small/big for other columns. If coefficients are made comparable then we can also consider adding L1 or L2 regularization.

My proposal is that when fitting a model subtract the mean and divide by the std for each column. In the yaml file store the training mean, training std, and the coefficients of the transformed columns. Then when predicting with a model we transform with the stored mean and std. Use of the Models will be unchanged, but the stored coefficients will be comparable with each other.

Thoughts?

Price Equilibration

Price Equilibration Specification

This page will be used to spec out the price equilibration features for @mkreilly but presumably will be useful for all users of UrbanSim so others may weigh in if they would like. @waddell @jiffyclub in particular might want to follow this discussion.

Purpose

The purpose of price equilibration is to add functionality so that prices may adjust based on supply and demand factors in regional real estate markets. As it stands, UrbanSim predicts prices based on definite characteristics of each building - for instance, the changing average income in a given neighborhood or a change in transportation infrastructure. This does not take into account factors such as demand exceeding supply, which is extremely important as this factor can be used to model rising prices in areas where supply is low, which can lead to more feasible development if zoning allows such development. In general, this module will raise prices where demand exceeds supply and lower prices where the opposite is true.

Implementation

This will be an additional feature to the Location Choice Model predict method. The current predict method takes a set of choosers and a set of alternatives (where alternatives on the residential side are usually individual residential units) and places choosers into alternatives based on the computed PDF (there is a probability for every empty unit in the region, for every chooser). The current implementation then does a np.random.choice without replacement which is a "first come first serve" choice, which was called lottery choices in the OPUS framework. The technique here is to adjust prices so that the PDF is adjusted until, in principle, the market is cleared and demand doesn't exceed supply anywhere (it's an open research question as to how much the real estate market operates by "lottery" choice or by full "equilibration" - the real behavior is almost certainly between the two).

Detecting where demand exceeds supply

Given that each agent has a PDF of their choices, detecting where demand exceeds supply is a simple matter of summing the probabilities for each unit across agents. In fact, this could in principle be run directly at the unit level, and where the sum of the probabilities is greater than 1, prices need to be adjusted up (it is assumed that each location choice model has a price attribute and a negative relationship of price to demand).

In practice, unit level probabilities are probably too granular - real estate analysis is usually performed for each "submarket" and demand should be summed at the submarket level. One of the parameters to the predict method should be the definition of submarkets (a mapping of alternatives to submarkets). Although street nodes is one possibility of submarkets (which would yield high-detail and 226K submarkets), we should probably start with zonal submarkets (1454 submarkets in the Bay Area). Thus summing the PDF by zone for each chooser results in the aggregate demand for each zone. This can then be divided by the number of units available in that zone to yield a positive floating point number to indicate the demand for that submarket relative to the supply. A number of 1.1 would generally indicate that demand is 10% higher than the supply, while a number of .9 would indicate demand being 10% lower than supply.

Equilibration

Once the relationship of demand to supply is known as described above, excess demand and excess supply must be used to adjust prices. As described above, it is assumed that there is a column of prices for each location choice model and that the coefficient on price is negative (e.g. agents, all else equal, prefer lower prices). To use this module, a column must be identified as the "price column" of the alternatives so that the predict method can adjust prices on-the-fly.

Here there are a few implementation choices. Price is potentially also an interaction factor in other columns (e.g. income and price can be interacted to compute different elasticities to price of people with different income levels). This is not a huge problem if the entire predict method is run multiple times, but for performance reasons, it is possible to separate the price column from non-price columns, to precompute the sum of the utility for the non-price columns, and to never change this sum while adjusting only the price column on-the-fly. To start with, it's probably easier to run the whole predict method multiple times and avoid code complexity at the cost of performance. This can be reevaluated if performance is simply too slow.

How much to equilibrate

There are also some implementation choices on how exactly to adjust the prices. There likely exists a way to define an adjustment (Liming's paper?) which optimizes the change in price in order to clear the market (this is likely similar to gradient descent?). I would propose for the initial implementation to simplify this somewhat and use a more brute force method. The algorithm takes these parameters

  • the max adjustment per iteration
  • the max number of iterations
  • last year's price multipliers
  • a function which maps demand-supply ratios to price multipliers

Essentially each zone has a demand-supply ratio which were computed as described above for each submarket (zone) and these ratios get converted to price multipliers that operate at the same submarket (zone) level. In fact, the simplest version is probably for the ratios to be used as the multipliers - i.e. where demand exceeds supply by 10%, prices are raised by 10%. This change in price is probably capped at a maximum adjustment per iteration which is user defined (maybe 10%).

The algorithm then reruns with new prices (the price column times the price multipliers) and has a new set of demand-supply ratios. The algorithm can exit on a fixed number of iterations, which caps the adjustment of prices per year at the max number of iterations times the max adjustment per iteration.

Outcomes

Presumably the price multipliers are stored and carried forward from year-to-year as the starting place for next year's PDF. These price multipliers are zonal attributes so they can also be mapped for a given year or charted for each zone over the years of the simulation.

The real outcome from the simulation is the impact of the equilibrated prices on development feasibility in the high-demand low-supply areas. We will need to run a simulation with price equilibration turned on compared to price equilibration turned off and look at the difference in the sum of developed units per zone, especially as contrasted with the map of price shifters described in the previous paragraph. We should quickly learn if increased prices serve to increase development, or if zoning simply drives up prices in the areas of high demand without increasing feasibility (presumably because zoning restricts denser development even if it would be feasible). A notebook of these outcomes as maps or charts is also within the scope of this task.

Open Questions

  • Is this primarily for residential/housing or should we implement for non-residential as well?

sim.model as a class?

One thing I'm noticing w/ the sim.model wrappers is it seems that models are being re-initialized for each year in the simulation. This seems undesirable since initialization is then being called many times over. For example:

@sim.model('household_transition_model')
def household_transition_model(households, year, controls_df, control_col_name):
    t = transition.TabularTotalsTransition(controls_df, control_col_name)
    tm = TransitionModel(t)
    updated, added, updatedLinks = tm.transition(households, year)
    # ...

I've gotten around this by creating an instance of the model outside of the model wrapper and then injecting it, for example:

t = transition.TabularTotalsTransition(controls_df, control_col_name)
tm = TransitionModel(t)
sim.add_injectable('hh_transition', tm)

@sim.model('household_transition_model')
def household_transition_model(hh_transition, households, year):
    updated, added, updatedLinks = hh_transition.transition(households, year)
    # ...

Is there a better way to do this? Declaring the model themselves as injectables certainly works but it feels a little cumbersome. It might be nice if we could decorate a callable class. Then we could define initialization logic in the init method to create model instances and then run year specific code in the callable method?

thanks,
scott

DataFrame Explorer ValueError

Hi Everyone,

I originally made this post on the SanFran_Urbansim git but it seems like those forums have been pretty inactive recently and the source of my error seems to be coming from "..\urbansim\maps\dframe_explorer.py".

I've been trying to run the SanFran_Urbansim DataFrame Explorer with my own data. I have run into the same "Query Failed" message whenever I try to access some of the drop down tables in the explorer (specifically 'buildings', 'jobs', and 'households')

I also tried running the original SanFran_Urbansim code from this github to see if it was an issue related to the code I changed but it still has the same errors.

The following is the ValueError message that keeps showing up:

Traceback (most recent call last):
File "C:\Users\hlc42705\AppData\Local\Continuum\Anaconda2\lib\site-packages\bottle.py", line 862, in _handle
return route.call(**args)
File "C:\Users\hlc42705\AppData\Local\Continuum\Anaconda2\lib\site-packages\bottle.py", line 1740, in wrapper
rv = callback(*a, **ka)
File "C:\Users\hlc42705\AppData\Local\Continuum\Anaconda2\lib\site-packages\urbansim\maps\dframe_explorer.py", line 47, in map_query
results = {int(k): results[k] for k in results}
File "C:\Users\hlc42705\AppData\Local\Continuum\Anaconda2\lib\sitepackages\urbansim\maps\dframe_explorer.py", line 47, in 
results = {int(k): results[k] for k in results}
ValueError: invalid literal for int() with base 10: '146.0'
127.0.0.1 - - [29/Jun/2017 09:05:35] "GET /map_query/households/empty/zone_id/building_type_id/mean() HTTP/1.1" 500 1812

Based off of the error it looks like the issue stems from the urbansim package in line 47 of dframe_explorer.py; any ideas of how to fix this or if i'm doing something wrong when I run the explorer?

I do have the most recent updated version of urbansim (via pip install/update).

High Level Interface

I'm starting to put down some ideas for the high level interface to urbansim, you can see some sketches here: https://gist.github.com/jiffyclub/0bb253757547fbc14add. There isn't a huge difference between the three sketches there now, but I'll keep thinking about things and probably our dialogue will spur some ideas all around.

Let me know what looks good, what's missing, what's on your wishlist for this, etc.

FutureWarning .as_matrix() > .values

Update needed to remove future warning for .as_matrix(), for example:

FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  distances, indexes = self.kdtree.query(xys.as_matrix())

Setting an attribute on the dataframe wrapper class

I just want to make sure I'm doing this the most reasonable way. So when these things were dataframes before, I used to do a lot of:

outdf[output_fname].loc[new_units.index] = SOMETHING

Now outdf is a dataframe wrapper so there's not .loc function, which is fine. The below should work - is this the best way to do it? Basically I get the whole series, then set the right indexes, then set the whole series back. I imagine this is the best practice?

s = outdf[output_fname]
s.loc[new_units.index] = SOMETHING
outdf[output_fname] = s

Temporarily pin scipy < 1.3

We are running into an incompatibility between the current releases of scipy and statsmodels that produces an ImportError for the factorial module.

Temporary solution is to pin scipy<1.3 in setup.py, which we can drop once this is fixed in statsmodels: statsmodels/statsmodels#5747

Installing troubles

I cant get conda to install urbansim > 1.3,
Win 10, Py 2.7 64 bit

IPythonNotebookScratch>conda config --add channels synthicity
Skipping channels: synthicity, item already exists
IPythonNotebookScratch>conda install urbansim
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ......
Solving package specifications: ................................
Package plan for installation in environment C:\Anaconda2:

The following NEW packages will be INSTALLED:

    urbansim: 1.3-py27_0

Proceed ([y]/n)?n
IPythonNotebookScratch>conda install -c https://conda.anaconda.org/synthicity urbansim
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ........
Solving package specifications: ................................
Package plan for installation in environment C:\Anaconda2:

The following NEW packages will be INSTALLED:

    urbansim: 1.3-py27_0

Proceed ([y]/n)?n

What am I doing wrong here?

Unit-level representation of housing

The current unit of analysis is buildings, yet there are many reasons for migrating to a unit-level representation of the housing stock. Units can have separate attributes such as floor, view, deeded affordability, as well as tenure, which could be fixed for a building, or change with economic conditions.

The main work for this suggestion would be in sorting out all the places the model currently touches the buildings table and replacing with a new units table, keyed on buildings.

need to be able to run multiple simulations at the same time

I would need each simulation to start with a different data store.
Before, in each simulation I was doing dset = LocalDataset('data/bayarea.h5')
How can we approach this now? Does it make sense to spawn a new process and reimport dataset.py?

merging combined with getting columns from models

A case that comes up all the time (in fact there's no model that doesn't work this way) is that we merge something to nodes (or zones) and then run patsy on it to take the right variables. OK, now we have functions to get the variables needed by a model so in principle I ought to be able to ask the dataframe wrapper for only those fields. The problem is that I have to merge the dataframes outside of the dataframe wrapper (I can't merge two dataframe wrappers) and so at that point I have to ask for all fields in both dataframe wrappers.

Is this there a better way to do this? Let me ask try to ask this again. So I have a model where half the fields come from buildings and half from nodes. Am I supposed to ask buildings for the fields from the model (ignoring the ones it doesn't have) and then do the same for nodes, and then merge them? In practice, the merge would then have to happen after I've read in the model configuration. Is this the right way?

Code Injection Security Vulnerability

Hello,

This line is vulnerable to Code Injection :

results = eval(cmd)

An attacker can execute arbitrary code on your server by accessing the url /map_query/table/empty/groupby/field/.count();import os;os.system('evil_cmd')

This will result in the variable cmd getting set to:
"df.groupby('groupby')['field'].count();import os;os.system('evil_cmd')".

Then, eval(cmd) will result in executing the evil_cmd payload and compromising your server. This is a serious security risk.

To fix this problem, i would advise you to sanitize the parameters table, filter, groupby, field, agg before passing them to eval. An even better solution would be to not use eval at all, but this fix might be more complicated.

I found the bug while testing DeepCode’s AI Code Review. The tool can help you automate the process of finding such (and many other types of) bugs. You can sign-up your repo (free for Open Source) to receive notifications whenever new bugs are detected. You can give it a try here.

Any feedback is more than welcome at [email protected].

Cheers, Victor.

Exception raised in developer if no feasible buildings

If feasibility for all buildings of a given form is n/a, developer.pick raises a ValueError exception.

Suggested bugfix in developer:64:

was:

        mu = self._max_form(f, "max_profit")
        indexes = [tuple(x) for x in mu.reset_index().values]

should be:

        if len(f) > 0:
            mu = self._max_form(f, "max_profit")
            indexes = [tuple(x) for x in mu.reset_index().values]
        else:
            indexes = []

No feedback if iteration does not reach exact result in utils.sampling.sample_rows

Function utils.sampling.sample rows includes a loop which shall converge to an exact pick of samples.
In some cases, this loop cannot converge.
Example:

import pandas as pd
import urbansim.utils.sampling
df = pd.DataFrame({'tot': [2,2,2]})
sample = urbansim.utils.sampling.sample_rows(total=3, data=df, replace=False, accounting_columns='tot')
sample.sum()  # = 2 and not = 3 as expected

Even if an exact pick is possible, the random permutation l.51 causes a further problem. If the values needed for an exact pick are at the head of the randomized list sample_idx, they are not available in the loop to adjust for an inexact pick, the loop cannot converge.
This aspect depends on the random seed, so it can make integration tests fail randomly.

Suggested solution:
there should be some feedback from the sample_rows function whenever convergence to an exact pick is not achieved. The caller may use this information if he wishes. In this way a more robust test suite can be obtained.

UrbanSim using outdated version of Patsy

I discovered this error while trying to fit a SegmentedMNLDiscreteChoiceModel. Patsy throws an error when creating the dmatrix. Looking into it I found that the author of Patsy fixed this when upgrading Patsy from v.0.4.0 to v0.4.1

I upgraded Patsy and this corrected the error. I don't know if the latest UrbanSim release bundles Patsy 0.4.1, but it will need the newer version of the package to fit models.

add_table vs. @sim.table()

@jiffyclub One thing I've encountered is that I only have the option between add_table, which runs every time I import the file, and the table wrapper function.

So in the former case I have a file where all my tables are added to the sim - if I need one of them I have to read in all of them.

In the latter case these things run on-the-fly, and I can't set columns on a table function, which is probably appropriate since these functions are used for merges and other things like that.

What I was also hoping for was something in between. A function which gets executed the first time the call is made which knows how to grab the table from wherever and then calls add_table. That way I don't have to read in every table if I don't need them all. The results would then be cached like in add_table() so this function is only called once, not like sim.table() which is called every time. Or perhaps there's another way to enable this workflow...

problem in segmented transition model and pandas

I'm having issues with a segmented tabular transition model using pandas 0.17.0. The issue seems to occur when calling df.query in the urbansim.models.util.apply_filter_query method. Sometimes when called, no results are returned even when the query is valid. The issue is intermittent, so different segments are affected across different runs.

Has anyone experienced anything like this or had issues with DataFrame.query? The same code and data works fine in pandas 0.13.1.

thanks,
scott

Vision Solver

Vision Solver First Pass Algorithm

This page will be used to spec out the price equilibration features for @mkreilly but presumably will be useful for all users of UrbanSim so others may weigh in if they would like. After talking with Mike, we decided to make this the third task after #110 and #111, since the other two tasks will probably make "solving the vision" easier.

Purpose

The purpose of this module is to find the policy inputs that yield a desired distribution of development in a region, where policy inputs can include the possibility of fees and subsidies. The method here should be consistent with related work - #110 and #111 - which adjusts prices based on the interaction of supply and demand and provide subsidies for development respectively. The vision solver needs to work for both residential and non-residential outcomes, but I imagine that its use will be emphasized for development of residential units (rather than commercial floor space).

Implementation Details

In short, the vision solver takes a subarea of the region, something the size of a zone, a neighborhood, or a PDA (priority development area), and a target for the number of units (and floorspace) that subarea should contain in a future year. There are a certain number of units (and floorspace) currently in the subarea, and the difference between the amounts is the target for development in the subarea. This module is being written primarily to understand development in PDAs in the Bay Area region, of which there are a couple hundred PDAs with a few hundred parcels each.

It should be noted right away that there are a few outcomes that can result from this analysis:

  • If the zoning and feasibility are consistent with the target, the model will choose among buildings (presumably weighted based on profitability), to pick the more likely to be built buildings.

There are a number of other cases which bring about issues in the model (and are the purpose for this module):

  • The zoning might be inconsistent with the target, which would essentially create an infeasible solution. This should be flagged by UrbanSim, but must be corrected by the modeler (e.g. the zoning must be changed to accommodate the target)
  • The feasibility of development might be inconsistent with the target. This would take base year (observed), or simulated (for a future year), or simulated plus equilibrated versions of the prices, use them in feasibility, build all the feasible buildings, and still not meet the target. In this case, subsidies must be applied to meet the target. This process is rather simple since developments are already sorted based on profitability and the developments can be chosen off the stack until the target is met, with the sum of negatively profitable buildings being the subsidy that is required. (Cross subsidizing unprofitable buildings with profitable ones is part of task #111 which will help us reach the vision this reason, task #111 is probably a dependency of this task).

Visualization of results

The output of this analysis is a set of buildings which can be visualized in UrbanCanvas. Not only can the planner now see what the rough amount of floor space required to meet the target actually looks like in 3D, but also a reasonable set of parcels that might actually be redeveloped will be the ones that are chosen by the feasibility analysis. This is the real power derived from tying together visualization and analysis in this way.

Doing the above in the context of a 30 year simulation

It is pretty straightforward to imagine a workflow which uses UrbanCanvas to set the target for a given PDA, uses the Python module to pick a reasonable set of parcels in order to meet the target, and returns the results to UrbanCanvas for visualization. The power of this approach to the problem is presumably that the user could be allowed to manually edit the results that were generated by the model, or perhaps zoning could be edited directly in UrbanCanvas so that an iterative workflow is possible.

In general though, the purpose of this module is to operate within the context of a 30 year forecast. After chatting with Mike, my best proposal on how to solve this would be to try to achieve 1/30th of the target for each year in the 30 year simulation - not that exactly that number of units would be built each year - that would clearly based on the specific parcels being developed, but somehow the accounting is maintained so that on average 1/30th of the target is developed each year. If the targets account for the entirety of the control totals, presumably we will need dampening in other zones which are also profitable either by charging fees or by having limits or by some other mechanism.

general use of injectables

@jiffyclub Matt - got a question for you. I've taken to using injectables as a way to store general state in the simulation. In this specific case I'm talking about the submarket_ratios (price shifters) that are used in supply and demand, but I use them for other data too - things that aren't appropriate as tables. First question, does this jive with your idea of UrbanSim - to store state as injectables? It's pretty handy, and works well.

If the answer to the first is yes, I have two follow up questions. First, I don't always inject the injectables - I mean sometimes I use them in a module rather than in a "sim.model" - in other words, I often use sim.get_injectable(). When I do this though, and the injectable is a function, the actual function gets returned. Is this something that should be fixed? I mean, I don't have autocall=False set or anything so I would sort of expect it to call the function and actually return the value - e.g. isn't this how get_table works?

Second, I sometimes check to see if state is already set and have to do it this way if key in sim.list_injectables(): - I wonder if it would be appropriate to pass an optional second parameter which is what gets returned if the key does not currently exist, similar to .get on a dictionary. So instead of needing to check if it exists or not, you just call sim.get_injectable(key, None) or something similar. Maybe this would require doing the same on the other "get" methods, or maybe not. Thoughts?

__len__

What do you think about adding len to the dataframe wrapper?

Change in predict() in statsmodels 0.8 no longer drops NaN values in returned Series

Statsmodels released 0.8.0 last month. Part of the changes are:

Backwards incompatible changes and deprecations

  • predict now returns a pandas Series if the exog argument is a DataFrame, including missing/NaN values

We use the statsmodels predict method in the regression module. Previously, any missing/NaN values passed to the exog argument would be removed from the returned Series. This would result in a mismatch between the lengths of the input data and the predicted data, raising the ModelEvaluationError here. Looks like with this change, the predicted data comes back the same length with NaN values included, so the assertion fails.

Places where a parcel model like urbansim_defaults would interact with this include:

  • Hedonic simulation
  • Location choice simulation
    In both of these examples, the urbansim_defaults code does a check for NaN values in the input DataFrame before going into the predict() method, so behavior should not change.

The easiest fix for this is probably just to extend the predict function in regression.py to drop rows with NaN values in the returned Series, only if it's a Series (seems like this change to statsmodels only applies to input DataFrames/output Series. NDarrays would be unchanged). The current code looks like this:

sim_data = model_fit.predict(df)

This is the proposed change that should keep the old behavior is:

sim_data = model_fit.predict(df)
if isinstance(sim_data, pd.Series):
    sim_data.dropna(inplace=True)

I'll issue a PR with the above change if folks don't have other ideas, but wanted to raise the issue first in case there's something I haven't considered.

Remove zbox as dependency

Matt Davis's zbox package is no longer necessary, as the toolz packages comes with a package called tlz that provides the same functionality.

We should remember to bump the required version of toolz to v0.8.1, as this is the first version that provides this functionality.

.columns returns error

When I call .columns on the dataframe wrapper it's giving me an error right now - I think the arrays need to be turned into a list or flattened or something.

usability issues with lcm

I'm getting to the point that I'm looking for fixes for problems I infrequently encounter where I think we need to deal with errors better. Basically I'm seeing a few issues that get uncovered all the way down in mnl that I've tracked down as (I think) having to do with nans in the design matrix. @jiffyclub since you worked on the patsy stuff maybe you know about this better than I do... Basically I can catch the error by adding an assertion below.

https://github.com/synthicity/urbansim/blob/new-simulation-testing/urbansim/models/lcm.py#L341

Do you know if design_matrix simply rows with nans?

Now there are two ways I've been able to trip this so far - the first is that I can forget to run the hedonic before the location choice model. This would give a whole columns full of nans and (I think) removes all the rows.

The other way I've tripped the assertion is by passing building_type_ids that are nan - if I filter these it works ok. I'm not sure if this removes all the nans from the dataframe - I haven't checked this yet but need to.

So, does this sound right to you - having nans in the dataframe causes a crash in mnl? Do you think we should turn this assertion into an exception that it descriptive to this effect?

And then on my end I have to do something smart with the nans I guess. I think we might need an additional layer of security here. Like, can I add a list of model dependencies to a model which have to run first - e.g. hedonic before lcm - and then instead of the nan error I get an error that I haven't run the hedonic yet?

Or alternatively (or in addition) maybe we could describe the nans that are present in a dataframe - this many nans in this field, that many in that field and so on? Thoughts?

zone based san francisco "canonical" model

I think we need a canonical super simple San Francisco data only model. I've parsed the data down to just San Fran and it's only about 10MB zipped up. If we make it not dependent on urbanaccess it should be trivial to just download and run and it runs about 20 seconds per year or so. I think this is the best way to show people how to use UrbanSim and also to test a full run on a nightly basis or so. I will try and get this task done soon.

Use joblib for cacheing

joblib looks like it could be really useful for cacheing results of decorated functions in the simulation. The great thing about joblib is that it caches to disk so that you don't have to worry about memory usage and bases the cacheing off the argument values so it will return cached values if the function inputs have not changed, or re-run the function if the arguments have changed.

Maintenance release: v3.2

This issue is to plan a maintenance release for the urbansim library.

The current release is v3.1.1 and the new one will be v3.2. I'd like to include as many pending bug fixes and compatibility updates as possible, without making any changes that would be backwards-incompatible.

As a heads-up, aspects of this library’s functionality have been replaced by other more targeted packages, which are where additional feature development is most likely to occur:

- task orchestration is now handled by Orca
- real estate development logic is now in Developer
- discrete choice logic is now in ChoiceModels
- reusable model step patterns are in UrbanSim Templates

Previously merged PR’s

These PR’s have been merged to the master branch subsequent to v3.1.1, and will be included in this release:

- PR #198 (support for exporting ordered dicts to yaml)
- PR #200 (bug fix in transition model error checking)
- PR #201 (temporarily pins Pandana at v0.3)
- PR #211 (warns if logit estimation may not have converged)

Compatibility updates

Additional updates that i think should be included in this release:

  • resolve deprecation warnings (issues #215, #216)
  • resolve dependency incompatibilities (issues #169, #214, #217)
  • streamline CI tests and include more recent Python versions
  • enable distribution on conda-forge (involves issue #203)

Before finalizing

  • update documentation and wiki
  • update version numbers

Anything else we should consider including?

Accounts System

Opening an issue for @mkreilly regarding the proposal for the accounts system described here.

Mike here's a start on what I'm thinking (don't hold me to specific syntax yet).

@accounts.method("percent of value")
def percent_of_value(buildings, tif_percent):
    return buildings.residential_sales_price * buildings.square_footage * tif_percent

# add other methods here for fees and incremental tax and fees

@accounts.source("tif")
def tif_source(buildings, parcels):
    parcels = parcels.query("TIF_ZONE == True")
    buildings = buildings.query("value > 30000 and type == MFD")
    buildings = sim.merge([buildings, parcels])
    return buildings

@accounts.destination("tif")
def tif_destination(parcels):
    parcels = parcels.query("TIF_OUT_ZONE == True")
    parcels = parcels[parcels.allowed("MDF")]
    return parcels

# we run every flow yearly for now
@accounts.flow("tif")
def tif_inflow(source):
    return accounts.percent_of_value(source, .01)

 # to turn it off  
 # @accounts.disable("tif")

# returns dataframe of all inflows and outflows, like a bank account
accounts.get_transactions("tif")

# return the balance of the account
accounts.get_balance("tif")

The two questions that I think are the biggest are these:

  1. When do we use the account? I imagine there's a max you can use per building or per unit really. Really things can be pretty complicated right... So let's say a parcel allows multi-family or mixed use, but mixed use is the only thing that could possibly get the subsidy. This means we need to integrate the subsidy pretty deep into the developer model so that the subsidy is taken into account when choosing between buildings types - is this right? So really, setting up the inflow is pretty easy - getting the outflow right looks harder.

  2. The next issue I wonder about is this. How many of the above accounts will there be? Is there likely to be one per city for some of these accounts so that we either need to find a way to wrap the above in a for loop for each city or to take the underlying geographies (cities) into account more explicitly inside the decorators?

yaml.load() deprecation warning: YAMLLoadWarning

Getting yaml.load() deprecation warning:

YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.

can be fixed with for example:

from yaml import CLoader as Loader, CDumper as Dumper

d= yaml.load(f, Loader=Loader)

Complications with [cient]_urbansim and urbansim_defaults

I think we should start a discussion - though this isn't necessarily a bug or an urgent problem - about the relationship of client implementations and urbansim_defaults. I mean, right now everything is working as designed but I wonder if we can / should improve the design.

To summarize the problem, we have models/variables/tables defined in Python code in urbansim_defaults and then other models/variables/tables defined in bayarea_urbansim. This is so we can share the default definitions across clients, which makes code more heavily tested and high-level features get updated once for all clients.

The biggest problem is that you can lose track of what is defined where and have to click between the two repos to find the place something is defined. In my experience working with UrbanSim users, this is one of the biggest UrbanSim-related headaches people encounter.

And it's especially a problem when using something like autoimport in the Notebook because when you're trying to override the default with something of the same name and then edit the file from urbansim_defaults, you can actually redefine the default.

Anyway, we've discussed some solutions to this - like whether we should process the defaults into the same file or something, but maybe we can come up with better ideas.

Second, it might be nice to have a command on the sim framework that prints out what is currently defined for a table/variable/model like the website has - I guess it prints out the source code. I can imagine this might work with the ipython interact stuff too.

Examples don't use orca

From what I've gathered you've generalized urbansim.sim into orca. That's awesome! Unfortunately, most of the docs still refer to urbansim.sim, making things rather confusing for new users.

Report number of records in estimation

It would be very useful if the dcm estimation class reports how many records were used in the estimation. Especially in segmented choice models, when the segments might be smaller than estimation_sample_size, it would be extremely helpful to have that info.

It could go into the report_fit method. I would create a pull request but I don't know where the sample size is stored.

setting up urbansim with no unresolved dependencies is a good idea

This is in reference to PR #81, but it's not directly related so I wanted to keep the information as an issue for future references. I just want to record a note that describes a couple of issues I had after making this PR.

Basically the first thing run does when saving tables is save the baseline data out to the runs directory. This is great for record keeping but means all of the tables/columns have to be designed in such a way that there are no failed dependencies on the baseline data. In fact, the website runs the same way (this was the land_price issue @gbenegas was seeing before). Basically the model system needs to be setup so you can run to_frame() on all tables without crashing.

For instance, I had a dependency where I was reindexing a couple of columns from the nodes dataset and yet the nodes dataset doesn't get generated until accessibility is run the first time (I stopped reading default nodes data from disk), which takes about 20 seconds. So if you want to read the parcel dataset you would have had to run a couple of models first.

It's definitely true that I can add nodes as a @sim.add_table function and then when I inject nodes it will first run the model and set the output to the nodes table which will resolve when reindexed to the parcel table. Somehow this still feels wrong to me, basically because it makes reading the parcel table take so long, and running the nodes data still feels like a model to me.

So now I just gracefully exit the two computed columns that reindex from the nodes dataset and everything works smoothly if it hasn't been computed yet. This is definitely a matter of preference/convention, but I wanted to start a conversation on the issue for future reference.

Worth noting that the simulation can actually run fine with unresolved dependencies in the model system, but writing to file and using the website both have the potential to fail because they run to_frame on all tables directly on the baseline.

YAML or JSON for config files

I wanted to float the idea of using YAML for config files instead of JSON. The only real reason for choosing YAML over JSON is that YAML is easier for humans to type, involving less quotations and brackets. On the code side things are pretty much identical. Below are examples of the same config file in JSON and in YAML. Thoughts? (No rush at all to make a decision on this.)

JSON

{
    "model": "hedonicmodel",
    "output_table": "dset.buildings",
    "estimate_filters": [
        "units.unit_lot_size > 0",
        "units.year_built > 1000",
        "units.year_built < 2020",
        "units.unit_sqft > 100",
        "units.unit_sqft < 10000",
        "units.sale_price_flt > 30",
        "units.sale_price_flt < 1000"
    ],
    "NOTUSED": "dset.fetch_csv('nodes.csv',index_col='node_id')",
    "add_constant": true,
    "internalname": "units",
    "patsy": "I(year_built < 1940) + I(year_built > 2005) + np.log1p(unit_sqft) + np.log1p(unit_lot_size) + sum_residential_units + ave_unit_sqft + ave_lot_sqft + ave_income + poor + jobs + sfdu + renters",
    "merge": {
        "table": "dset.nodes",
        "right_index": true,
        "left_on": "_node_id"
    },
    "output_varname": "res_sales_price",
    "table": "dset.homesales",
    "table_sim": "dset.building_filter(residential=1)",
    "dep_var": "sale_price_flt",
    "dep_var_transform": "np.log",
    "output_transform": "np.exp"
}

YAML

---
NOTUSED: dset.fetch_csv('nodes.csv', index_col='node_id')
add_constant: true
dep_var: sale_price_flt
dep_var_transform: np.log
estimate_filters:
    - units.unit_lot_size > 0
    - units.year_built > 1000
    - units.year_built < 2020
    - units.unit_sqft > 100
    - units.unit_sqft < 10000
    - units.sale_price_flt > 30
    - units.sale_price_flt < 1000
internalname: units
merge:
    left_on: _node_id
    right_index: true
    table: dset.nodes
model: hedonicmodel
output_table: dset.buildings
output_transform: np.exp
output_varname: res_sales_price
patsy: I(year_built < 1940) + I(year_built > 2005) +
    np.log1p(unit_sqft) + np.log1p(unit_lot_size) +
    sum_residential_units + ave_unit_sqft + ave_lot_sqft
    + ave_income + poor + jobs + sfdu + renters
table: dset.homesales
table_sim: dset.building_filter(residential=1)

Error while loading parcels

import dataset
import urbansim.sim.simulation as sim
sim.get_table('parcels').to_frame()

is working fine, however the following

import models
import dataset
import urbansim.sim.simulation as sim
sim.get_table('parcels').to_frame()

gives
IOError: File ./data/nodes_prices.csv does not exist

dframe_explorer not rendering colors on map

When using the dataframe explorer on our zone system, everything is displaying properly except the actual colors in the zones. In other words: the openstreemap tiles are there, the zone outlines are there, the legend updates when I adjust the query or the colors, and I can manually call the queries at a python prompt and get the proper values returned, the problem is no colors rendering within the zones.

Here is what I am using to start the dframe_explorer:

d = {tbl: sim.get_table(tbl).to_frame() for tbl in ['buildings', 'jobs', 'households', 'zones']}

dframe_explorer.start(d, 
        center=[33.45, -112.075],
        zoom=11,
        shape_json='data/zones.json',
        precision=2)

Next Gen Dev Mdl

I want to open up an issue about the work that probably needs to be done soon on the next generation developer model. I'm going to outline how this might look from my perspective but I expect there will be feedback from multiple places. @jdoliveira and @cvanegas I'm looking at you ;)

OK, as a reminder, this is how the developer model works now. Basically we take in a set of parcels, with max FARs, max height limits, parking requirements, a standard set of costs, another set of prices, and probably a few other things. We first precompute the whole set of possible inputs on a limited set of input parcels. Specifically we test multiple FARs and figure out the "break even" price that if this price is exceeded in the marketplace, the development is profitable. This is really nice from a performance standpoint because we can precompute feasibility and then everything from that point on is a lookup for each parcel and the inputs associated with that parcel. This is VERY fast and so really useful for regional modeling in which case we're testing 2M parcels for feasibility every year. Keep in mind each UrbanSim simulated year runs in about 6 minutes, and the feasibility model is already 1 minute of that time. The proposal I'm going to describe here is likely to be MUCH slower. I will describe how we might parallelize it and run in C, but I just don't think there's any getting around that it will be an order of magnitude slower than the current implementation, and the bottleneck of the simulation. As such, its primarily use case will be subarea studies, although we can try it in UrbanSim too.

In general I'm thinking along these lines. In UrbanCanvas we have what we call development types, which in my mind are similar to "products" that a developer might describe. Here's a discussion of standard developer product types and also the need for alternatives. UrbanFootprint for instance uses about 90 building types. At any rate, we will also have some set of building types, which we call development types. The idea here is integrate these development types into the inner loop of a feasibility calculation. So in our case, development types will be a product type, like "affordable townhome, wood construction" or "luxury condo, mid-rise." Therefore, our development types provide a number of implicit assumptions, which are detailed and very useful to both feasibility and visualization -

  • Most importantly we have can create the "buildable area" based on height limits and setbacks and odd shaped parcels and figure out if we can actually fit the available amount of FAR on the parcel
  • There is an implicit assumption of materials which should help define a specific product type right from the RSMeans handbook so we can get the cost of specific development type
  • There is an implicit assumption of quality of the resulting building, which will define where on the price curve this building might exist (is it luxury or affordable and balance the prices accordingly)
  • Other inputs might be tied to developments types, like unit mix and parking requirements, etc. For instance, luxury condos might have a different unit mix and parking requirements than other developments
  • The visualization of the building is tied directly to all these things so that the visualization is at the same level of detail as the analysis

OK, so my proposal is that we do this the slow and accurate and easy to modify way now. I'm assuming that each parcel has a certain number of development types that we test on it. Like, we might test townhomes and mid-rise condos on certain parcels. Mid-rise condos probably have a range of densities, so we might test them at the max far allowed by zoning (assuming it's allowed by the geometry) as well as the inflection points (which are essentially the points just below heights at which construction costs increase). But there is a clear set of development types/FARs that we test, probably on the order of about 10-20 forms per parcel. The development types already give us space by use (like, the number of residential sqft and retail sqft and office sqft, etc) and we can translate this to a feasibility using a pencil out pro forma (exactly the same as we do with penciler). We go from gross to net square footage, have unit mix multipliers (to give us unit mix), prices per unit, costs per gross square footage, and parking requirements and size use per parking space and cost per sqft for different kinds of parking, etc (these were all in Penciler).

Now, having implemented this in Penciler (in Javascript) I can say it's quite a bit easier to code, maintain, and understand a pro forma written in this way than the current vectorized implementation written in UrbanSim/numpy (remember the current version is written for performance). Here, I think we know we have to avoid Python for its for loops, so I think we have to go to Cython, Numba, or similar. I don't have a great deal of experience with Cython so there's overhead in learning it, but I imagine it will be pretty easy to read once it's learned and written. In this way, we only really need to code the pencil out pro forma for one development type at a time (with for loops rather than vectors and if statements rather than multiplies), and then wrap it up for the 10-20 development types that are allowed per parcel. I'm guessing we'll need to call the CGA parser locally as a C library in the inner loop. I do not think it will be possible to call this as a service as it might take too long, but we can try it that way at first since we already have it compiled and running that way. Since it's in Cython, it will be as fast as C, and it can be parallelized with OPENMP for platforms that support that. Don't think this means fast though - if this is run a million times it will be very slow. Presumably this can also be integrated directly into UrbanCanvas (Cython compiles to native C code, though I don't understand the details yet).

This sort of approach will solve a number of problems for us, as numerous projects are asking for subarea high-detail pro forma calculations, with the ability to modify inputs and see the results at a neighborhood scale. Sooner or later we'll probably have to go this way and I wanted to start a discussion on how we can get it done.

Dose max_dua account for resratio?

We are testing a pro forma model. We got it running without max_dua. Then we added a max_dua column. No more non res, nor mixed use construction! looking into it we had set the max_dua = 0 for non res parcels.

I expected that max_dua = 0 would lead to only non res construction.

I observed that max_dua = 0 lead to no construction.


My theory:

max_far_from_dua does not account for resratio. There for any building has max_far_from_dua > 0. Thus no new construction.

move models.utils into utils?

Does anyone object to moving models.utils.py into utils.model_utils.py (or something similar)?

I'm having an issue with circular dependencies because importing models.utils seems to also import everything from the models directory.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.