Comments (30)
Some questions that might help:
- Which version of UrbanSim are you using?
- Can you be more specific than "few months"?
- Is the memory usage during estimation extravagant or does it seem usual?
I honestly have no good ideas here because in theory there have been no changes to the way LCM estimation works. I really need to know which step(s) of the process is taking all the time, preferably with a profile dump.
from bayarea_urbansim.
I'd also want to know if it's actually sampling alternatives correctly, and the memory use could be a clue to that. It might be worth a test with a really simple model, with only a few alternatives, with only a few "choosers."
from bayarea_urbansim.
I wonder if the problem is the estimation sample size of 15000. The estimation_sample_size
parameter is used to control how many choosers are used during estimation. So estimation is trying to be done with 15000 choosers x 100 alts per chooser. That might slow things down. But I think that's how it has always been done?
from bayarea_urbansim.
Yeah I noticed that and tried it with more like 1500 and it was still slow.
from bayarea_urbansim.
Fletcher, did you notice this also? Do you or Sam have any idea what
change might have triggered it or how long ago this started?
Paul
On Fri, Aug 7, 2015 at 12:08 PM, Fletcher Foti [email protected]
wrote:
I'd also want to know if it's actually sampling alternatives correctly,
and the memory use could be a clue to that. It might be worth a test with a
really simple model, with only a few alternatives, with only a few
"choosers."—
Reply to this email directly or view it on GitHub
#65 (comment)
.
from bayarea_urbansim.
Usually something more like 100 would be used. Results don't change very
much with larger sample sizes.
On Fri, Aug 7, 2015 at 12:48 PM, Fletcher Foti [email protected]
wrote:
Yeah I noticed that and tried it with more like 1500 and it was still slow.
—
Reply to this email directly or view it on GitHub
#65 (comment)
.
from bayarea_urbansim.
I haven't estimated the HLCM in a while, but I tried it out and and it never finished estimating, so there seems to be a problem. Matt changed a few things in the HLCM a few months ago which is the only thing I can think of.
from bayarea_urbansim.
This 15,000 or 1,500 is the sample size of choosers not the sample size of alternatives. Right now the choosers are the actual households in the synthesized population so need to be sampled.
from bayarea_urbansim.
Thanks for looking into this. The codebase is the current synthicity bayarea_urbansim master. The timeframe would be since the last time the "hlcm_estimate" model was run in the copy of Estimation.ipynb that's in the repo. That actually looks like October 2014 -- the notebook file was changed in March 2015, but that cell wasn't updated.
from bayarea_urbansim.
I ran a test in sanfran_urbansim so I could profile the hlcm estimation. It has slowed down too, though it completes in 4-5 minutes with 3000 choosers x 4 segments. There the slowness is in the sampling of alternatives in mnl_interaction_dataset
(see this changeset: UDST/urbansim@d17a5ce). np.random.choice
is slow and calling it many times adds up. I don't know if that's the exact same thing causing bayarea_urbansim to be so slow, but it's definitely contributing. (It'll be called (estimation_sample_size * number of segments) times at 0.02218 seconds per call on my laptop.)
But if that's the only problem I'd expect it to have finished in a few minutes when @fscottfoti tried setting estimation_sample_size
to 1500 and in around 30 minutes with 15000.
from bayarea_urbansim.
It's possible I didn't wait long enough - I probably stopped at about 3 minutes. We should probably provide feedback to the user, like printing the log liklihood at each iteration or something. And if the sampling takes a long time, maybe some feedback during that as well.
from bayarea_urbansim.
But this same estimation used to take far less time to converge, right? If
it is now taking 3 hours to estimate a reasonable sized model, no-one will
use it to experiment, and that leads to poorly specified models. Hopefully
we can track down the origin of the slowdown and get back to fast
performance for this.
On Fri, Aug 7, 2015 at 5:31 PM, Fletcher Foti [email protected]
wrote:
It's possible I didn't wait long enough - I probably stopped at about 3
minutes. We should probably provide feedback to the user, like printing the
log liklihood at each iteration or something. And if the sampling takes a
long time, maybe some feedback during that as well.—
Reply to this email directly or view it on GitHub
#65 (comment)
.
from bayarea_urbansim.
Well I think LCMs were wrong before UDST/urbansim@d17a5ce, but maybe it was acceptably wrong? I will brainstorm ways to make that faster but nothing comes immediately to mind.
from bayarea_urbansim.
Interesting. Yeah, printing some feedback would be a helpful stop-gap.
Last night I left the estimation running -- as it's currently specified in bayarea_urbansim master, with 15,000 choosers and 100 alternatives -- and it finished after about 5 hours. Screenshot attached.
from bayarea_urbansim.
Do we know if this change was what caused estimation times to change a lot?
I agree sampling without replacement is the preferred way to do this,
though in practical terms it is not clear that it would change parameter
estimates, given the extremely low probability that an alternative out of a
huge universal choice set (say a million alternatives) would be sampled
more than once in an estimation set consisting of, say, 100 alternatives.
And if it did, on extremely infrequent basis, would it affect the estimated
coefficients? I doubt it. Did we ever test whether this was the case?
On Fri, Aug 7, 2015 at 5:50 PM, Matt Davis [email protected] wrote:
Well I think LCMs were wrong before UDST/urbansim@d17a5ce
UDST/urbansim@d17a5ce, but maybe it was
acceptably wrong? I will brainstorm ways to make that faster but nothing
comes immediately to mind.—
Reply to this email directly or view it on GitHub
#65 (comment)
.
from bayarea_urbansim.
You can get quite detailed logging output from UrbanSim. Put this before you run hlcm_estimate
:
import logging
from urbansim.utils.logutil import log_to_stream, set_log_level
set_log_level(logging.DEBUG)
log_to_stream()
What would be really helpful for me is if you generated a profile I can look at for one of your 5 hour runs. You can do that in the Notebook by running hlcm_estimate with the prun magic:
%%prun -q -D hlcm_estimate.prof
orca.run(["hlcm_estimate"])
Then email me that hlcm_estimate.prof file.
from bayarea_urbansim.
Hi @jiffyclub this looks really helpful, thanks. When I run the latter block, it fails with:
KeyError: 'no step named hlcm_estimate'
But if i replace orca.run
with sim.run
it seems to go fine. Would that do the trick as well, or do i need to get it working through orca? I haven't used that before but just installed it using the conda command from here.
from bayarea_urbansim.
@smmaurer sim.run
is fine, just means y'all haven't converted to orca yet.
from bayarea_urbansim.
@jiffyclub Great, here's the profile that was generated: https://github.com/ual/bayarea_urbansim/raw/estimation-test/estimation-test/hlcm_estimate.prof
from bayarea_urbansim.
Thanks, I can confirm that all the time is going to drawing alternative samples at https://github.com/UDST/urbansim/blob/acafcd9ce9f67a1d7924d512a32a83facf4904ea/urbansim/urbanchoice/interaction.py#L58.
from bayarea_urbansim.
Yup, changing that line to sample WITH replacement fixes it. A 30-minute model now estimates in 8 seconds. As Paul said, the replacement shouldn't make much difference in practical terms, so I'll submit this as a pull request. Thanks so much for figuring it out!
from bayarea_urbansim.
Sam, would you run the same heck with and without replacement and compare estimated parameters?
Paul
Sent from my iPhone
On Aug 10, 2015, at 7:24 PM, Sam Maurer [email protected] wrote:
Yup, changing that line to sample WITH replacement fixes it. A 30-minute model now estimates in 8 seconds. As Paul said, the replacement shouldn't make much difference in practical terms, so I'll submit this as a pull request. Thanks so much for figuring it out!
—
Reply to this email directly or view it on GitHub.
from bayarea_urbansim.
If we want to go back to allowing duplicates I would revert UDST/urbansim@d17a5ce so there is only one call to np.random.choice
. And either way should we add a check to make sure the number of alternatives is much larger than the sample size? That seems like the only situation in which this kind of draw would be wise.
Another option I discovered is using the Python stdlib's random.sample, which is some 1000x faster than np.random.choice
. The only problem with that is that it creates another entry place for randomness so we/folks would have to seed two random number generators to get repeatable runs (assuming the NumPy and Python random number generators are separate).
from bayarea_urbansim.
Seems like it would be worth it if that works.
from bayarea_urbansim.
I agree. If we have a fast way to do sampling without replacement, I would
prefer that. But sampling with replacement looks like the lesser of two
evils if it takes so long using np.random. Let's try random.sample and see
how it works.
On Tue, Aug 11, 2015 at 9:34 AM, Fletcher Foti [email protected]
wrote:
Seems like it would be worth it if that works.
—
Reply to this email directly or view it on GitHub
#65 (comment)
.
from bayarea_urbansim.
Also it seems like this is something the other UrbanSim would have encountered - if we need to we can dig into how they got random numbers.
from bayarea_urbansim.
Here's a comparison of estimation results with and without replacement. (Easiest way to compare is by opening the link in two different windows.)
https://github.com/ual/bayarea_urbansim/blob/estimation-test/estimation-test/Estimation-test.ipynb
The first run is WITHOUT replacement and the next two runs are both WITH replacement, to illustrate typical variation between runs of an identical model.
The main difference between WITH and WITHOUT is in the autoPeakTotal term -- not sure why. The other discrepancies are in line with the typical variation between identical runs.
from bayarea_urbansim.
From @waddell in slack: "Interesting. It does seem that the differences are generally comparable to the variation between runs. Not entirely sure about the reason autoPeakTotal might be more sensitive to this - but perhaps it has something to do with having less variability to begin with. Let’s go with this for now, and later try Matt’s suggestion of using stdlib's random.sample."
from bayarea_urbansim.
I've gone ahead and merged UDST/urbansim#148 to restore speedy behavior at the cost of allowing repeated sampling of alternatives. Note that I've also recently removed the simulation framework from UrbanSim (use Orca instead) so if you pull from the UrbanSim master you'll get those changes as well.
from bayarea_urbansim.
I think it is ok to leave the np.random.choice in ActivitySim.
from bayarea_urbansim.
Related Issues (20)
- json serialization TypeError: 1442 is not JSON serializable
- requirements file HOT 1
- version control for outputs (ipython notebook versions?) HOT 7
- version control for dependencies HOT 2
- Estimation.ipynb generates NA's in config parameters HOT 1
- How are configs written out by Estimation.ipynb? HOT 2
- misaligned lookup table values? HOT 4
- documentation should specify that the requirements of data regeneration are different from those of simulation HOT 1
- Confusing as a Service: debugging data regeneration processing tasks with different dependencies HOT 2
- data regeneration->estimation.py->nrh_estimate: no object named costar in the file HOT 2
- data regeneration: use better filters/spatial queries to identify identical parcel geometries HOT 7
- stable identifier for parcels "across runs" HOT 5
- Price model: spikes in future years for many HOT 1
- Estimating a model with random subset of data? HOT 3
- getting changes from UAL HOT 2
- Incompatibility with UDST/urbansim PR #172 HOT 5
- accessory_units model step fails for simulation years not in 2010, 2015, 2020, etc.
- summaries.py fails with KeyError on jobs table
- issues related to the update to the new microdata
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bayarea_urbansim.