arviz-devs / arviz Goto Github PK
View Code? Open in Web Editor NEWExploratory analysis of Bayesian models with Python
Home Page: https://python.arviz.org
License: Apache License 2.0
Exploratory analysis of Bayesian models with Python
Home Page: https://python.arviz.org
License: Apache License 2.0
following #45 evaluates if the computation of the effective sample size needs to be corrected for circular variables.
It would be great to have a nice, generic version of this:
See mwaskom/seaborn#450 for discussion.
The suggested matplotlib recipes are wanting by comparison, I think (http://matplotlib.org/users/recipes.html).
Wanted to open an issue, since this is tricky, and I do not want to get too far away from the current implementation while doing this.
I have copied the current version I have below. Notice that the styling is slightly different: notably there is one subplot for each variable being plotted. Happy to hear any thoughts, otherwise I am going to keep working on getting all the options from the current version working here!
I think there's a bug in the travisci jobs regarding which python version is being used for testing
On line 513 (https://travis-ci.org/arviz-devs/arviz/jobs/411966731#L513( python 3.5 in a virtualenv is correctly used, but when travisci runs pytest its using python 3.6 in all jobs
(https://travis-ci.org/arviz-devs/arviz/jobs/411966731#L1430)
Same condition can be seen here in the intended python 3.4 pipeline
https://travis-ci.org/arviz-devs/arviz/jobs/411966732#L1430
@ahartikainen Working on adding quick start examples in #95, and would love to include pystan
examples.
Do you have any examples of turning the pystan
fit to a dataframe? I think xarray
is a better answer long term (#4 (comment)), but for a first release, hacking a small function around dataframes seems doable.
Currently mcmcplotlib use gaussian_kde for kde generation and this can be quite slow with large number of points.
There is a fft-based fastkde function in faststats package that give comparable results if there are large number of points.
faststats is MIT-licenced so in theory the fastkde function could be copied/implemented into the source code of mcmcplotlib so nothing should need to be imported (function need scipy).
faststats https://github.com/mfouesneau/faststats/tree/master/
fastkde https://github.com/mfouesneau/faststats/blob/master/faststats/fastkde.py
Should we include mc_error
as part of summary? Is this still a useful measure?
Here's the video from SciPy 2015. We are urged to use the following signature:
def my_plotter(ax, data, style):
return artists
https://youtu.be/PmDUx62cTTs?t=762
So there's two things:
p.s. It sounds like there are some changes coming in matplotlib that might make working with plots that have a lot of moving parts (e.g., lines + error bars) easier. https://www.youtube.com/watch?v=PmDUx62cTTs
There have been proposals to use xarray as a common language for pymc3, pystan, and pymc4. This library might be a good place to start that by implementing utility functions for translating pymc3's Multitrace and pystan's OrderedDict into xarray objects, and then having all plotting functions work with xarrays.
This needs to be done to both the pymc3 and pystan converters.
Per request in #154 (comment)
Hey all,
Open ended issue here. I feel like the ROPE visualization right now would be better if it didn't hide the HDP. See the plot of mu below.
Proposals I have are
I can provide mockups of any of these if people want.
Alternatively if this behavior is desired do nothing, sort of like accepting the null hypothesis if you know what I mean :) (Sorry I had to)
If you add a script to examples/
that includes a plot, the docs will put it onto http://arviz-devs.github.io/arviz/. A few things to note:
az.*plot(...)
to title the plot_thumb: .5, .5
. The second number does nothing, the first number shifts the thumbnail left/right or up/down, depending on which is thinner (the other dimension takes up the full height or width).I believe this should replace #71 -- we will still need notebooks for any interactive altair
plots, but this is a pretty pleasant format. Note also that these serve as a sort of integration test.
Just so that we don't lose this code...
@numba.jit(nopython=True)
def mult_conj(a):
"""Multiply a with conj(a) and store in a.
[Re, Re, Im, Re, Im...]
like output from scipy.fftpack.rfft
"""
n = len(a)
if n == 0:
return
a[0] = a[0] * a[0]
for i in range(1, (n + 1) // 2):
j, k = 2 * i - 1, 2 * i
a[j] = a[j] * a[j] + a[k] * a[k]
a[k] = 0
if n > 1 and n % 2 == 0:
a[n - 1] = a[n - 1] * a[n - 1]
@numba.jit
def autocov2(x):
assert len(x.shape) == 2
y = x - np.mean(x, axis=-1, keepdims=True)
n = y.shape[-1]
shape = 2 * n - 1
fft_len = fftpack.next_fast_len(shape)
fx = fftpack.rfft(y, fft_len, overwrite_x=False)
for i in range(x.shape[0]):
mult_conj(fx[i, :])
out = fftpack.irfft(fx, fft_len, overwrite_x=False)
out = out[..., :n]
out[:] /= np.arange(n, 0, -1)
return out
@numba.jit(nopython=True)
def get_neff_inner(trace_value, acov):
"""Compute the effective sample size for a 2D array
"""
n_chains, n_samples = trace_value.shape
assert acov.shape == (n_chains, n_samples)
chain_mean = np.sum(trace_value, 1) / n_samples
chain_var = acov[:, 0] * n_samples / (n_samples - 1.)
acov_t = acov[:, 1] * n_samples / (n_samples - 1.)
mean_var = np.mean(chain_var)
var_plus = mean_var * (n_samples - 1.) / n_samples
var_plus += np.var(chain_mean) * n_chains / (n_chains - 1.)
rho_hat_t = np.zeros(n_samples)
rho_hat_even = 1.
rho_hat_t[0] = rho_hat_even
rho_hat_odd = 1. - (mean_var - np.mean(acov_t)) / var_plus
rho_hat_t[1] = rho_hat_odd
max_t = 1
t = 1
acov_means = np.sum(acov, 0) / n_chains
while t < (n_samples - 2) and (rho_hat_even + rho_hat_odd) >= 0.:
rho_hat_even = 1. - (mean_var - acov_means[t + 1]) / var_plus
rho_hat_odd = 1. - (mean_var - acov_means[t + 2]) / var_plus
if (rho_hat_even + rho_hat_odd) >= 0:
rho_hat_t[t + 1] = rho_hat_even
rho_hat_t[t + 2] = rho_hat_odd
max_t = t + 2
t += 2
t = 3
while t <= max_t - 2:
if (rho_hat_t[t + 1] + rho_hat_t[t + 2]) > (rho_hat_t[t - 1] + rho_hat_t[t]):
rho_hat_t[t + 1] = (rho_hat_t[t - 1] + rho_hat_t[t]) / 2.
rho_hat_t[t + 2] = rho_hat_t[t + 1]
t += 2
ess = nchain * n_samples
ess = ess / (-1. + 2. * np.sum(rho_hat_t))
return ess
@numba.jit()
def get_neff2(x):
"""Compute the effective sample size for a 2D array
"""
neff = np.zeros(x.shape[0])
nvals, nchain, n_samples = x.shape
for i in range(len(x)):
trace_value = x[i]
nchain, n_samples = trace_value.shape
acov = autocov2(trace_value)
neff[i] = get_neff_inner(trace_value, acov)
return neff
def effective_n(mtrace, varnames=None, include_transformed=False):
def generate_neff(trace_values):
"""Trace values is an array (*varshape, n_chains, n_samples)"""
x = np.stack([val.T for val in trace_values], axis=-2)
*varshape, n_chains, n_samples = x.shape
x = np.ascontiguousarray(x.reshape((-1, n_chains, n_samples)))
return get_neff2(x).reshape(varshape)
if not isinstance(mtrace, pm.backends.base.MultiTrace):
# Return neff for non-multitrace array
return generate_neff(mtrace)
if mtrace.nchains < 2:
raise ValueError(
'Calculation of effective sample size requires multiple chains '
'of the same length.')
if varnames is None:
varnames = pm.util.get_default_varnames(mtrace.varnames,include_transformed=include_transformed)
n_eff = {}
for var in varnames:
n_eff[var] = generate_neff(mtrace.get_values(var, combine=False))
return n_eff
with pm.Model() as model:
a = pm.Normal('a', 0, 1)
pm.Normal('b', sd=tt.exp(a), shape=1000)
trace = pm.sample(10000)
%timeit effective_n(trace)
#1.96 s ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pm.effective_n(trace)
#25.7 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
On a fresh virtual environment,
pip install -e .
fails with
Obtaining file:///home/colin/projects/arviz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/colin/projects/arviz/setup.py", line 7, in <module>
from matplotlib import get_configdir
ModuleNotFoundError: No module named 'matplotlib'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /home/colin/projects/arviz/
I wonder if we can do some matplotlib magic after installing the requirements, or maybe just on import?
What about adding bokeh backend for plots?
From time to time this test fails.
Local imports are missing .
After #180 this will probably be the last plot to get to parity with pymc3 plotting.
This issue is actually more of a question, it is quite possible that we can resolve this without actual code changes. arviz
maintains autocorr
code in stats.diagnostics._autocorr
and plots.autocorrplot
(both are inherited from the pymc3
codebase). Both here and in pymc3
the way the autocorrelation is computed differs, in particular plots.autocorrplot
uses np.correlate(y, y, mode=2)
whereas stats.diagnostics.autocorr
uses fftconvolve(y, y[::-1])
. Is there a reason why plots.autocorrplot
does not simply use stats.diagnostics.autocorr
under the hood?
Currently values are rounded and then printed. More formal / better way should use string formatting. This would not affect location of plotted lines.
Example from plot_post
post_summary['mean'] = round(np.mean(sample), roundto)
...
plt.plot(0, label='mean = {:g}'.format(post_summary['mean']), alpha=0)
to
post_summary['mean'] = np.mean(sample)
...
plt.plot(0, label='mean = {:.{roundto}g}'.format(post_summary['mean'], roundto=roundto), alpha=0)
Simple problem: the pip install in the readme does not work at the moment. It seems to be a problem with the two relative import statements in __init__.py
:
from .plots import *
from .stats import *
Could use an example of violintraceplot
. Possibly @canyon289 ?
See http://arviz-devs.github.io/arviz/examples/index.html for other gallery examples, and see, for example https://github.com/arviz-devs/arviz/blob/master/examples/traceplot.py for where that is generated. You can (probably?) build docs locally with
cd doc/
make html
# on linux:
# google-chrome _build/html/index.html
# on OSX:
open _build/html/index.html
It would be nice to have a good general solution to autoscale the figure size (and fontsize, markersize, etc) as a function of the number of subplots. Our current implementation is not general enough and needs special care for each type of plot.
Hey, not sure if there's an interest in this or if this package is actively maintained? I really like this idea of centralizing MCMC plotting tools.
When doing a traceplot, it can help to link the kernel density/histogram to the trace if the histogram/kdeplot closes over the trace. If seaborn is used for the kernel density, it's only a keyword argument to flip. I think, if this option would be useful in this package, then it should be possible to use ax.fill_betweenx
.
There are a lot of other views I've built that might be more useful generally for MCMC that could take arbitrary sample chains.
I realize that this is undergoing development at the moment, but what is the intended method for constructing a simple traceplot for a Pystan fit object with 2-dimensional array variables? The following line seems to discard variables that are 2-dimensional? I get an error saying that pars
is empty if I try to just call traceplot
directly on the Pystan fit object:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-101-28cd15193e9c> in <module>()
----> 1 arviz.traceplot(samples)
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/plots/traceplot.py in traceplot(data, var_names, coords, figsize, textsize, lines, combined, kde_kwargs, trace_kwargs)
37 axes : matplotlib axes
38 """
---> 39 data = convert_to_xarray(data)
40
41 if coords is None:
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in convert_to_xarray(obj, coords, dims, chains)
73 return obj
74 else:
---> 75 return get_converter(obj, coords=coords, dims=dims, chains=chains).to_xarray()
76
77
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in get_converter(obj, coords, dims, chains)
37 return DictToXarray(obj, coords, dims, chains=chains)
38 elif obj.__class__.__name__ == 'StanFit4Model': # ugly, but doesn't make PyStan a requirement
---> 39 return PyStanToXarray(obj, coords, dims)
40 elif obj.__class__.__name__ == 'MultiTrace': # ugly, but doesn't make PyMC3 a requirement
41 return PyMC3ToXarray(obj, coords, dims)
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in __init__(self, fit, coords, dims)
389 The coordinates are those passed in and ('chain', 'draw')
390 """
--> 391 super().__init__(fit, coords=coords, dims=dims)
392
393 def to_xarray(self):
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in __init__(self, obj, coords, dims)
80 self.obj = obj
81 # pylint: disable=assignment-from-none
---> 82 self.varnames, self.coords, self.dims = self.default_varnames_coords_dims(obj, coords, dims)
83 self.verified, self.warning = self.verify_coords_dims()
84
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/arviz/utils/xarray_utils.py in default_varnames_coords_dims(obj, coords, dims)
451 for varname in varnames:
452 if varname not in dims:
--> 453 vals = obj.extract(varname, permuted=False)[varname]
454 if len(vals.shape) == 1:
455 vals = np.expand_dims(vals, axis=1)
stanfit4template_f63d55caee4d725d95ec10b01b4500a0_1018734232036915966.pyx in stanfit4template_f63d55caee4d725d95ec10b01b4500a0_1018734232036915966.StanFit4Model.extract()
~/anaconda3/envs/stan-dev/lib/python3.6/site-packages/pystan/misc.py in _check_pars(allpars, pars)
805 def _check_pars(allpars, pars):
806 if len(pars) == 0:
--> 807 raise ValueError("No parameter specified (`pars` is empty).")
808 for par in pars:
809 if par not in allpars:
ValueError: No parameter specified (`pars` is empty).
Has the final shape of the expected xarray been settled? When calling traceplot, can you still specify variables using the names of vector/matrix parameters, or do you need to specify the variables with the flatnames
(i.e. unfolded array variables, scalars with names like X_dims_1_2
)?
Looks like each travis task deploys docs to github. It still all goes smoothly, so no big deal, but it'd be nice to fix. Fixing this should be an if
statement, possibly in bash, somewhere.
It would be nice to let as many of the plots work with numpy arrays as possible, and to have generic constructors that the pymc3 and pystan converters can use.
I'm working on this now.
Probably a little bit too early, but is nice to have a logo. What do you think? @twiecki @ahartikainen
This logo gets 3 💯 at PyMC3's slack channel and I did not vote! 😄
Is it intentional to have a different traceplot from pymc3? Each element of an RV with shape > 1 gets a plot in arviz
, while they are all the same for pymc3
.
Explicitly, with eight schools (I only capture the first few plots from arviz):
J = 8
y = np.array([28, 8, -3, 7, -1, 1, 18, 12])
sigma = np.array([15, 10, 16, 11, 9, 11, 10, 18])
with pm.Model() as hierarchical:
eta = pm.Normal('eta', 0, 1, shape=J)
mu = pm.Normal('mu', 0, sd=1e6)
tau = pm.HalfCauchy('tau', 5)
theta = pm.Deterministic('theta', mu + tau*eta)
obs = pm.Normal('obs', theta, sd=sigma, observed=y)
trace_h = pm.sample(1000)
I think a gallery of available plots would make this library much easier to use, possibly using https://sphinx-gallery.readthedocs.io/en/latest/ . The goal would be at least one working example of each plot.
In order to get to feature parity with pymc3 plotting, we need to have a way to access sampler statistics (specifically, to access divergences), and ppc samples. @aseyboldt outlined a good way to think about the rest of the schema here.
At the same time, xarray supports groups, but it doesn't look like it does so natively (yet? see the discussion at pydata/xarray#1092).
I am proposing something like
import xarray as xr
import netCDF4 as nc
class Trace(object):
def __init__(self, filename):
self.filename = filename
self.data = nc.Dataset(filename)
self.groups = self.data.groups
def __getattr__(self, name):
if name in self.groups:
return xr.open_dataset(self.filename, group=name)
raise AttributeError("informative message")
def __dir__(self):
"""Allows for tab completion on netCDF group names"""
return super(Trace, self).__dir__() + list(self.groups.keys())
This is a pretty light wrapper around netCDF and xarray. Usage is something like
t = Trace('mytrace.nc')
t.posterior # this is an xarray.Dataset
t.posterior.mu.mean() # calculate the mean of a variable
I think this will have to change a little bit so that nested groups work fine. In particular, something like
t = Trace('mytrace.nc')
t.sampler_stats.divergences # should return an xarray.Dataset
t.sampler_stats # I think this would tend to return an empty xarray.Dataset
Since DataFrames are simply indexed and labelled numpy arrays (from memory and performance point of view, at least), why don't we use DFs as interface between tracing library and plotting?
e.g. the code might look like this:
import pymc3 as pm
import mcmcplotlib as mcp
with pm.Model() as model:
..
trace = pm.sample(2000..)
trace_df = trace.as_df(varnames=['Intercept', 'x', 'sigma'])
mcp.plot_trace(trace_df, varnames=['x', 'sigma']) #e.g. we are not interested in plotting intercept
So that mcmcplotlib don't have to know about library-specific trace object(s).
And we don't have to pass a list of numpy arrays + list of varnames for plotting.
I believe it is a safe assumption that most users of pymc, pystan and emcee have pandas installed.
I just wanted to make sure folks saw the very nice plots being generated (in R) in bayesplot:
https://github.com/stan-dev/bayesplot
Here's one:
Edit: add image
I think that would be a great to have that as a reference and quick overview for new users of what arviz can do.
This mostly means removing f-strings throughout. I may play around with the travis build, too, to try to balance thoroughness and test time.
Example see figure 9 in https://arxiv.org/pdf/1709.01449.pdf
Reference usage in bayesplot: http://mc-stan.org/bayesplot/reference/PPC-loo.html
Related code in bayesplot: https://github.com/stan-dev/bayesplot/blob/master/R/ppc-loo.R
Computation of loo_pit:
https://github.com/stan-dev/rstantools/blob/fd84944454f8f0fe1ad9d456211f73a24feaa0ef/R/loo-functions.R#L64-L74
They should probably be marked with a system name and a 32/64 for floating point. As an alternative, maybe just a try/except
, and rebuild the pickles if there's a problem? This would be easier and not very time-consuming (the pickles are mostly for travis' benefit).
Even though everything will eventually work with netCDF files, this is important since making sure pystan/pymc3 objects convert correctly is important.
There are a few places where we would like to use arviz.kdeplot
if a variable is continuous, and a histogram if not (traceplots, for example).
This might involve implementing a histplot
first and then having distplot
just inspect the dtype
.
(happy if someone wants to change the names of any of these...)
Hello everyone,
I just noticed that trace_to_dataframe
may fail on entirely valid pymc3.MultiTrace
objects, this is due
to the following code:
if type(trace).__name__ == "MultiTrace"
.
I get the intention, I believe, one wants to avoid a dependency on pymc3
.
However, if I import pymc3
as import pymc3 as pm
and then use pm.backends.base.MultiTrace
to construct a trace I will get an error as:
type(trace).__name__
will return "pm.backends.base.MultiTrace
.
I suggest the following approach:
try:
from pymc3.backends.base import MultiTrace
except ImportError:
pass # this still avoids a dependency on pymc3
else:
is_multitrace = isinstance(trace, MultiTrace)
This avoids a dependency and handles arbitrary ways to import MultiTrace
.
I have code ready for this, if you agree with me on the way forward I'm willing to provide a pull request.
Best,
MFreidank
Hi,
I suggest that we refactor the interface to have a common data structure. I created this issue so we could start working on these.
In pystan we get Ordered dict with
odict = fit.extract()
and with pymc3
from itertools import OrderedDict
odict = OrderedDict()
for name in trace.varnames:
odict[name] = trace.get_values(name)
These have a common structure considering only the data. Sample size is on the first axis.
Most of the plots do not need chain number etc (excluding forestplot etc). I have tested these with needed modifications and it works with results from both libraries.
The other "problems" are logp (not a problem) and possibility to extract priors and other information.
Both pystan and pymc3 have a possibility to call logp with a function call so these should be handled.
Also there are functions from pymc3 that need to be handled.
ps. Some naming conventions
trace
vs traces
/ Multitrace
burn-in
vs warm-up
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.