probcomp / bdbcontrib Goto Github PK
View Code? Open in Web Editor NEWBayesDB contributions, including plotting, helper methods, and examples
Home Page: http://probcomp.csail.mit.edu/bayesdb
License: Apache License 2.0
BayesDB contributions, including plotting, helper methods, and examples
Home Page: http://probcomp.csail.mit.edu/bayesdb
License: Apache License 2.0
It does appear to work, and may not actually require any code changes in bdbcontrib, just changes in the notebook examples.
If the examples require mpld3 then we should depend on it for pip install and included it in osx etc.
Feature request: include an option to caption a plot with the the BQL query (i.e. --show-caption). For instance
.heatmap 'ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi' --last-sort
would have ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi
as a caption. Long captions would have ellipses after 100 chars.
Are a dozen warning snecessary to indicate that there is no shortname for a column?
bayeslite> .ccstate states_cc 15
Warning: No shortname found for illiteracy. Using column name.
Warning: No shortname found for geo. Using column name.
Warning: No shortname found for lifeexp. Using column name.
Warning: No shortname found for livealone. Using column name.
Warning: No shortname found for murder. Using column name.
Warning: No shortname found for gdp. Using column name.
Warning: No shortname found for hsgrad. Using column name.
Warning: No shortname found for minority. Using column name.
Warning: No shortname found for rape. Using column name.
Warning: No shortname found for frost. Using column name.
Warning: No shortname found for assault. Using column name.
Warning: No shortname found for income. Using column name.
Warning: No shortname found for population. Using column name.
Warning: No shortname found for divorce. Using column name.
.hook /absolute/path/to/contrib.py
is outdated after breaking up the files as suggested in (#2)
To reproduce:
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ bayeslite -m
No such file or directory /home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib.py
Can we produce an empty data frame rather than an error?
Can you make font-size adjustable as a command line flag for all plot commands in bdbcontrib? This is a real issue and its absence (+ the absence of useful defaults or a "--screen" and "--pdf" mode) should be considered a bug in the plotting code. Otherwise there's a valley of death between "iterating yourself" and "showing it to anyone else".
Performing .show 'SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit FROM inferred_orbit;' --colorby class_of_orbit --filename ../build/satellites.tmp/fig_1.png
Traceback (most recent call last):
File "/home/riastradh/bayesdb/master/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
return self.func(*args)
File "../../hooks/hook_plots.py", line 163, in pairplot
show_full=args.show_full)
File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/__init__.py", line 70, in pairplot
return pairplot(*args, **kwargs)
File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 153, in pairplot
show_full=show_full)
File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 772, in _pairplot
bdb=bdb, generator_name=generator_name, colors=colors)
File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 408, in do_hist
bins = _safer_freedman_diaconis_bins(data_srs)
File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 362, in _safer_freedman_diaconis_bins
h = 2 * iqr(a) / (len(a) ** (1 / 3))
File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/seaborn/utils.py", line 347, in iqr
q3 = stats.scoreatpercentile(a, 75)
File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1522, in scoreatpercentile
return _compute_qth_percentile(sorted, per, interpolation_method, axis)
File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1565, in _compute_qth_percentile
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
TypeError: can't multiply sequence by non-int of type 'float'
The operator_owner field of satellites is large in this sense.
It may also help for the schema to accept the size of categorical as a parameter (but: what to do if that's wrong? treat it as an upper bound?)
It would probably also help for GUESS(*) to surface the sizes of the categoricals.
For example
.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby Class_of_Orbit
works, but
.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby class_of_orbit
raises a KeyError
from pandas.
It makes sense that if column selection is case-insensitive, that the argument to --colorby should be case-insensitive as well.
the issue is in
There is some note on the documentation of plt.subplot (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot) which might explain the issue
"Creating a new subplot with a position which is entirely inside a pre-existing axes will trigger the larger axes to be deleted"
Needs at least crud in setup.py, license, and testing.
.readtohtml
saves figures by adding a --filename
argument to the command. .chainplot
only supports positional arguments.
When entering ".show" without arguments, the bayeslite shell freaks out and terminates.
to reproduce:
fsaad@fsaad-xps:/Documents/pcp/crime/src$ bayeslite -m/Documents/pcp/crime/src$ <--- bayeslite exits here --->
added command ".register_bql_math_functions"
added command ".readtohtml"
added command ".nullify"
added command ".heatmap"
added command ".show"
added command ".ccstate"
added command ".histogram"
added command ".bar"
added command ".chainplot"
Welcome to the Bayeslite shell.
Type `.help' for help.
bayeslite> .show
usage: bayeslite [-h] [-f FILENAME] [-g GENERATOR] [-s] [--no-contour] [-m]
[--colorby COLORBY]
bql [bql ...]
bayeslite: error: too few arguments
fsaad@fsaad-xps:
The plugin model defined in https://github.com/mit-probabilistic-computing-project/bdbcontrib/blob/fsaad-foreign-predictor/src/foreign/sat_orbital_mech.py is only able to evaluate Kepler's Laws in one direction, computing period given apogee and perigee. The same model cannot be used to reverse the computation, and so improved results are delivered only for specific inferences which have available apogee and perigee values.
The foreign predictor composer currently takes O(|V| (|V| + |E|)) time to topologically sort its constituents, and then just says that there is a cycle if it finds one without saying anything about the cycle.
We can compute a topological sort and report where the cycle was, if we found one, in linear time using Tarjan's algorithm[1] to compute the strongly-connected components of a directed graph in topological order: if any components have size >1, those are cycles and we can report them in the error message.
[1] https://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm
From @vkmvkmvkmvkm:
Needed to ask Feras about why obscure indexing error in barplot was happening. Answer was I needed 2 columns from SELECT output (which makes perfect sense in hindsight). Also iterating was difficult --- needed to ask Taylor about the bdb.savepoint() business b/c of annoying "table already exists"... --- and finally, can't collapse into single SIMULATE due to surface syntax limitations (this should be signposted in an error with a TBD: properly re-embed SQL)
"_ = bdbcontrib.barplot(satellites_bdb, '''
SIMULATE purpose FROM satellites_cc
GIVEN country_of_operator = ""China (PR)""
LIMIT 1000;
''')"
"<...> /Volumes/BayesDB/BayesDB-bayes0.1rc1-1-gd96fe22.app/Contents/MacOS/venv/lib/python2.7/site-packages/matplotlib/backends/backend_agg.pyc in init(self, width, height, dpi)
92 self.height = height
93 if debug: verbose.report('RendererAgg.init width=%s, height=%s'%(width, height), 'debug-annoying')
---> 94 self._renderer = _RendererAgg(int(width), int(height), dpi, debug=False)
95 self._filter_renderers = []
96
ValueError: width and height must each be below 32768
"
https://github.com/mwaskom/seaborn/blob/master/seaborn/distributions.py line 27 has
h = 2 * iqr(a) / (len(a) ** (1 / 3))
And iqr (inter-quartile range) can be zero for skewed distributions that are otherwise plottable.
It should fall back to max(a) - min(a) but doesn't.
I don't think I can fix that from outside.
In the actual data, the relationship is nearly deterministic in one direction:
select operator_owner, count(distinct country_of_operator) as ct
from satellites
group by operator_owner
order by ct desc
limit 1
says "Ministry of Defense, 3", but crosscat almost uniformly decides not to cluster them. @vkmvkmvkmvkm says this is likely due to operator_owner being a large (346 items) categorical, thus messing everything up.
bayeslite> .mihist
usage: bayeslite [-h] [-f FILENAME] [-n NUM_SAMPLES] [-b BINS]
generator col1 col2
bayeslite: error: too few arguments
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ < -- EXITED -->
Steps I forsee on the path:
EZMetamodel
or something (inIBayesDBForeignPredictor{Factory,}
register_foreign_predictor
to take a metamodel objectpredict_confidence
calls the predictor's simulate
and logpdf
_weighted_sample
calls the predictor's simulate
and logpdf
create_generator
callsdrop_generator
callsinitialize_models
to call sub-initializations
instantiate
from the EZMetamodel
?drop_models
to call sub-dropsAccording @axch
This is an outstanding task from #58. Stuff we might want to automate
One file per set of related commands.
From @LuaC:
importing bdbcontrib takes a while?
...and manage it with vendor branches like we do for lemonade, plex, &c., in bayeslite.
The .bar
command only displays the first half of the rows in the query. I reckon it's an xlim
issue; the rightmost bar is halved when the number of rows in the output is odd.
[I 11:25:06.744 NotebookApp] Kernel started: b9215e58-990d-4371-8aaf-d763eb5e639f
/Applications/Bayeslite-bayes0.1+unknown.app/Contents/MacOS/venv/bin/python2.7: No module named ipykernel
However, my sys.path when run within the venv does not include anything from the app, so this might be an issue with my system configuration of virtualenv?
The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.
Currently the composer
takes in a n_samples
parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.
No major concerns.
No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly O(nm) ~ O(n^3)
for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..
No major concerns. For large tables, dropping the internal crosscat
metamodel has empirically been shown to non-negligible time, which the composer
cannot change.
Runs initialize
for crosscat
(can be slow for large datasets).
Runs create
and serialize
for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.
TODO.
No joint inference, just crosscat
analysis.
Simple graph walk in the DAG, roughly O(E)
. Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.
Super expensive. For simulate n_samples
, we need to invoke _weighted_samples
roughly n_sample^2
times -- the weighted sampler is approximate and we need n_samples
to get one approximate sample from the posterior. We then invoke _joint_pdf
four times.
Super expensive. We need to compute the partition function (likelihood of the evidence constraints
). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.
Note that there are no known algorithms for reusing the samples for QY
and Y
.
Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the simulate
.
Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).
Delegates to crosscat
.
Delegates to column_value_probability
. I have issues with the query, see comment in the code.
Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.
Figure (pun intended) out a way to beautify the shape and size of plots returned by the api
plotting utilities such that IPython notebook renders them nicely.
This may reflect my own relative lack of experience with databases relative to the expected user base, but including an example of e.g.:
q('drop table if exists satellite_purpose')
prior to creating a new table might save certain users some time in figuring out how to use the tool.
@tibbetts began this on this branch: 20150910-tibbetts-sessions. Currently sessions are being stored into the database in table "bayesdb_session_entries".
Steps:
For some reason it now lists every available value with every color, instead of indicating which value corresponds to which color.
I think this happened when @gregory-marton and I were cleaning gen_collapsed_legend_from_dict
but we were rushed and/or didn't notice at the time.
bayeslite> .show 'ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF malawi'
Traceback (most recent call last):
File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
return self.func(*args)
File "/home/fsaad/Documents/pcp/bdbcontrib/hooks/hook_plots.py", line 163, in pairplot
show_full=args.show_full)
File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 154, in pairplot
show_full=show_full)
File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 757, in _pairplot
colors=colors)
File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 569, in do_pair_plot
ax = DO_PLOT_FUNC[hash(vartypes)](plot_df, vartypes, **kwargs)
File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 484, in do_violinplot
color='SteelBlue')
File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 336, in violinplot
vals, xlabel, ylabel, names = _box_reshape(vals, groupby, names, order)
File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 100, in _box_reshape
vals = [np.asarray(a, np.float) for a in vals]
File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: zygosity
could not convert string to float: zygosity
so it's machine readable. This is a residual task from #58.
Example: The following query would result in a printed line without the "_ = " prefix:
_ = do_query(satellites_bdb, '''
CREATE TEMP TABLE inferred_orbit AS
INFER EXPLICIT
anticipated_lifetime, perigee_km, period_minutes, class_of_orbit,
PREDICT type_of_orbit AS inferred_orbit_type
CONFIDENCE inferred_orbit_type_conf
FROM satellites_cc
WHERE type_of_orbit IS NULL;
''')
We once had the ISS shown as weirdest by expected_lifetime; some time later we documented Sicron 1A as the weirdest; now it is not shown as weirdest.
(a) We need to determine how to assess the stability of phenomena for our demos.
(b) We need to find stable phenomena for our demos.
(c) We need to automatically test these in our demos.
For example:
bayeslite> .show 'SELECT whz, muac, agemonths, diarrhea, fever, cough, diarrhea, vomiting, site FROM malawi;'
Traceback (most recent call last):
File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 50, in __call__
return self.func(*args)
File "/home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib_plot.py", line 223, in pairplot
colorby=args.colorby, show_missing=args.show_missing)
File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 466, in pairplot
generator_name=generator_name)
File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 116, in get_bayesdb_col_type
return guess_column_type(df_column)
File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 88, in guess_column_type
pd_type = df_column.dtype
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2150, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'dtype'
'DataFrame' object has no attribute 'dtype'
We should at least make it fail gracefully, with an informative error message.
Here is a report of the bug in IPython. Some thing worth doing is removing the tight_layout=True
setting.
I imagine that Sphinx's automodule honors the prefix underscore convention; failing that, could use autofunction instead to cherry pick documented functions.
What is our bdbcontrib documentation strategy in general?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.