probcomp / bdbcontrib Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 6.0 30.25 MB

BayesDB contributions, including plotting, helper methods, and examples

Home Page: http://probcomp.csail.mit.edu/bayesdb

License: Apache License 2.0

Python 87.10% CSS 0.15% Shell 0.34% Makefile 0.43% Jupyter Notebook 11.98%

bdbcontrib's People

Contributors

Stargazers

Watchers

Forkers

jayelm jostheim asilversempirical vineyk24 alxempirical vishalbelsare

bdbcontrib's Issues

Make plots prettier in notebooks using mpld3

It does appear to work, and may not actually require any code changes in bdbcontrib, just changes in the notebook examples.

http://mpld3.github.io/

If the examples require mpld3 then we should depend on it for pip install and included it in osx etc.

Introduce caption in plot commands of the BQL query.

Feature request: include an option to caption a plot with the the BQL query (i.e. --show-caption). For instance

.heatmap 'ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi' --last-sort

would have ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi as a caption. Long captions would have ellipses after 100 chars.

Many warnings when no shortnames are used

Are a dozen warning snecessary to indicate that there is no shortname for a column?

bayeslite> .ccstate states_cc 15
Warning: No shortname found for illiteracy. Using column name.
Warning: No shortname found for geo. Using column name.
Warning: No shortname found for lifeexp. Using column name.
Warning: No shortname found for livealone. Using column name.
Warning: No shortname found for murder. Using column name.
Warning: No shortname found for gdp. Using column name.
Warning: No shortname found for hsgrad. Using column name.
Warning: No shortname found for minority. Using column name.
Warning: No shortname found for rape. Using column name.
Warning: No shortname found for frost. Using column name.
Warning: No shortname found for assault. Using column name.
Warning: No shortname found for income. Using column name.
Warning: No shortname found for population. Using column name.
Warning: No shortname found for divorce. Using column name.

README should be updated since "contrib.py" no longer exists

.hook /absolute/path/to/contrib.py

is outdated after breaking up the files as suggested in (#2)

To reproduce:
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ bayeslite -m
No such file or directory /home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib.py

cursor_to_df should handle empty cursors

Can we produce an empty data frame rather than an error?

Make font size for all plotting commands adjustable

@vkmvkmvkmvkm

Can you make font-size adjustable as a command line flag for all plot commands in bdbcontrib? This is a real issue and its absence (+ the absence of useful defaults or a "--screen" and "--pdf" mode) should be considered a bug in the plotting code. Otherwise there's a valley of death between "iterating yourself" and "showing it to anyone else".

freedman diaconis binning is still broken (maybe just for categoricals?)

Performing .show 'SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit FROM inferred_orbit;' --colorby class_of_orbit --filename ../build/satellites.tmp/fig_1.png
Traceback (most recent call last):
  File "/home/riastradh/bayesdb/master/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
return self.func(*args)
  File "../../hooks/hook_plots.py", line 163, in pairplot
show_full=args.show_full)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/__init__.py", line 70, in pairplot
return pairplot(*args, **kwargs)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 153, in pairplot
show_full=show_full)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 772, in _pairplot
bdb=bdb, generator_name=generator_name, colors=colors)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 408, in do_hist
bins = _safer_freedman_diaconis_bins(data_srs)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 362, in _safer_freedman_diaconis_bins
h = 2 * iqr(a) / (len(a) ** (1 / 3))
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/seaborn/utils.py", line 347, in iqr
q3 = stats.scoreatpercentile(a, 75)
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1522, in scoreatpercentile
return _compute_qth_percentile(sorted, per, interpolation_method, axis)
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1565, in _compute_qth_percentile
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
TypeError: can't multiply sequence by non-int of type 'float'

Ignore: toy issue for better understanding github issues.

lick gremio

Our demos should not have large categoricals, because those mess us up

The operator_owner field of satellites is large in this sense.

It may also help for the schema to accept the size of categorical as a parameter (but: what to do if that's wrong? treat it as an upper bound?)

It would probably also help for GUESS(*) to surface the sizes of the categoricals.

.show --colorby argument should not be case-sensitive

For example

.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby Class_of_Orbit

works, but

.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby class_of_orbit

raises a KeyError from pandas.

It makes sense that if column selection is case-insensitive, that the argument to --colorby should be case-insensitive as well.

accessing an axes with plt.subplot(plt_grid[a,b]) clobbers the axis

the issue is in

https://github.com/mit-probabilistic-computing-project/bdbcontrib/blob/54be59a6a6b3cc2ade929eeb0f1b4c1adfd7fc44/bdbcontrib/plotutils.py#L492

There is some note on the documentation of plt.subplot (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot) which might explain the issue
"Creating a new subplot with a position which is entirely inside a pre-existing axes will trigger the larger axes to be deleted"

Package for pip install

Needs at least crud in setup.py, license, and testing.

.chainplot not compatible with .readtohtml

.readtohtml saves figures by adding a --filename argument to the command. .chainplot only supports positional arguments.

.show without arguments kills the bayeslite shell

When entering ".show" without arguments, the bayeslite shell freaks out and terminates.

to reproduce:

fsaad@fsaad-xps:/Documents/pcp/crime/src$ bayeslite -m
added command ".register_bql_math_functions"
added command ".readtohtml"
added command ".nullify"
added command ".heatmap"
added command ".show"
added command ".ccstate"
added command ".histogram"
added command ".bar"
added command ".chainplot"
Welcome to the Bayeslite shell.
Type `.help' for help.
bayeslite> .show
usage: bayeslite [-h] [-f FILENAME] [-g GENERATOR] [-s] [--no-contour] [-m]
[--colorby COLORBY]
bql [bql ...]
bayeslite: error: too few arguments
fsaad@fsaad-xps:/Documents/pcp/crime/src$ <--- bayeslite exits here --->

Session capture UI nits

A big, bold WARNING feels overkill for a non-session-breaking error,
especially since the query that printed it was the one in which I
fixed it.
Not telling me the object whose send_session_data method to call is
somewhat underhelpful, but I don't know a good way to fix that

Fill in LICENSE.txt

https://github.com/probcomp/bdbcontrib/blob/master/LICENSE.txt#L190

Orbital mechanics answers are asymetrically accurate with plugin model

The plugin model defined in https://github.com/mit-probabilistic-computing-project/bdbcontrib/blob/fsaad-foreign-predictor/src/foreign/sat_orbital_mech.py is only able to evaluate Kepler's Laws in one direction, computing period given apogee and perigee. The same model cannot be used to reverse the computation, and so improved results are delivered only for specific inferences which have available apogee and perigee values.

.ccstate should support names for each row

when using .ccstate to view the crosscat views, each row in the data set appears as an integer index on the y-axis. It would be useful to show a name for each row.

report cycle in foreign predictor composer in linear time

The foreign predictor composer currently takes O(|V| (|V| + |E|)) time to topologically sort its constituents, and then just says that there is a cycle if it finds one without saying anything about the cycle.

We can compute a topological sort and report where the cycle was, if we found one, in linear time using Tarjan's algorithm[1] to compute the strongly-connected components of a directed graph in topological order: if any components have size >1, those are cycles and we can report them in the error message.

[1] https://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm

Indexing error in barplot when not plotting two columns

From @vkmvkmvkmvkm:

Needed to ask Feras about why obscure indexing error in barplot was happening. Answer was I needed 2 columns from SELECT output (which makes perfect sense in hindsight). Also iterating was difficult --- needed to ask Taylor about the bdb.savepoint() business b/c of annoying "table already exists"... --- and finally, can't collapse into single SIMULATE due to surface syntax limitations (this should be signposted in an error with a TBD: properly re-embed SQL)

"_ = bdbcontrib.barplot(satellites_bdb, '''
SIMULATE purpose FROM satellites_cc
GIVEN country_of_operator = ""China (PR)""
LIMIT 1000;
''')"
"<...> /Volumes/BayesDB/BayesDB-bayes0.1rc1-1-gd96fe22.app/Contents/MacOS/venv/lib/python2.7/site-packages/matplotlib/backends/backend_agg.pyc in init(self, width, height, dpi)
92 self.height = height
93 if debug: verbose.report('RendererAgg.init width=%s, height=%s'%(width, height), 'debug-annoying')
---> 94 self._renderer = _RendererAgg(int(width), int(height), dpi, debug=False)
95 self._filter_renderers = []
96

ValueError: width and height must each be below 32768
"

turn the main code in plotutils.py into an automatic test

seaborn div0s on plotting super-skewed distributions.

https://github.com/mwaskom/seaborn/blob/master/seaborn/distributions.py line 27 has
h = 2 * iqr(a) / (len(a) ** (1 / 3))
And iqr (inter-quartile range) can be zero for skewed distributions that are otherwise plottable.
It should fall back to max(a) - min(a) but doesn't.
I don't think I can fix that from outside.

operator_owner is independent of country_of_operator?

In the actual data, the relationship is nearly deterministic in one direction:

select operator_owner, count(distinct country_of_operator) as ct
from satellites
group by operator_owner
order by ct desc
limit 1

says "Ministry of Defense, 3", but crosscat almost uniformly decides not to cluster them. @vkmvkmvkmvkm says this is likely due to operator_owner being a large (346 items) categorical, thus messing everything up.

.mihist without arguments exists bayeslite shell

bayeslite> .mihist
usage: bayeslite [-h] [-f FILENAME] [-n NUM_SAMPLES] [-b BINS]
generator col1 col2
bayeslite: error: too few arguments
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ < -- EXITED -->

Expand foreign predictors to be full foreign (conditional) metamodels (if desired)

Steps I forsee on the path:

Teach the predictors to implement the metamodel interface
- Define a combinator named EZMetamodel or something (in
  bdbcontrib) that implements the metamodel interface by delegating
  to an object meeting the IBayesDBForeignPredictor{Factory,}
  interfaces.
Make the composer interact with predictors through the metamodel
interface
- Tweak register_foreign_predictor to take a metamodel object
  rather than a builder
- predict_confidence calls the predictor's simulate and logpdf
  methods directly
- _weighted_sample calls the predictor's simulate and logpdf
  methods directly
Incrementally generalize the composer to composing arbitrary
(conditional) GPMs
- Add predictor-specific create_generator calls
- Add predictor-specific drop_generator calls
- Generalize initialize_models to call sub-initializations
  - Will this spam the bayesdb generator table with conditional
    generators?
  - Will I need to call instantiate from the EZMetamodel?
- Generalize drop_models to call sub-drops
- Generalize to ensembling the FPs (and give them an option to
  declare that they are deterministic enough not to benefit from it)
  (or a user-facing option to indicate how big an ensemble to make)
- Generalize the composer to handle multi-row queries
- Generalize the composer to handle FPs with multiple target columns
  - Treat the set of targets as a supernode in the dag, in several
    places.
  - Could add more hair for permitting the user to request
    simulation of only some targets and not others.
- Generalize to accept inference program arguments, both for
  selecting how to solve its own inference problems (e.g., whether
  to use likelihood weighting for situations where a predictor's
  output is constrained, and with how many trials), and for passing
  down to the FPs (if they accept inference program arguments
  themselves).
If desired, expand the FP interface to cover more of the ground that
the GPM interface covers. Where, then, to draw the distinction?
- Permit stochastic FPs
- Permit per-row latents, which would make a distinction between
  simulating from observed and unobserved rows
- Permit multi-row queries
- Permit multiple target columns
  - The interface to predictor simulations changes to return a list of
    lists (all the targets jointly), but the existing ones all require
    one target and return just the one value.
  - Same in logpdf

Outstanding issues in composer

According @axch

Um, include the casefold in the presence check at composer.py:178,
if that's what you meant.
Line 207: What are locls, fcols, pcols, and fpreds? Can we make the
names more descriptive?
- Might not be a bad idea to define a container class for the
  "hodgepodge", named something like "Configuration" or "Schema".
- Actually, a namedtuple will probably do
What's with the comment on line 215? Is that an example value?
If so, call it that and fix it (underscore)
Line 230: Why is this a bql query instead of a call to the metamodel
API?
- If it remains a BQL query, should quote the elements properly
Line 305: Quote the cc_name
Line 314: Why is this a BQL query intead of a call to the metamodel
API?
- Is there any robust way to pass the set of modelnos through, to
  maintain the intended 1-1 mapping between composer model numbers
  and underlying crosscat model numbers?
Line 514: Is this actually correct? Looks like a type error:
checking whether a tuple is in "ignore", even though that
accumulates just the columns. Would manifest as failing to ignore
things it should ignore.
- If you are not sure, this would be a good thing to poke with a
  test.
- We also know how to build the ignore table better, per Taylor's
  review of predictive_probability.

Set up satisfactory production-size Jenkins builds of the satellites example

This is an outstanding task from #58. Stuff we might want to automate

Building a big-enough set of satellites bdbs for distribution and/or stability testing
Building stability assessments
Serving the resulting plots from Jenkins (how?) or from a Jenkins-writable location
Automatically checking for egg-on-face from releasing the release .bdb file
- The particular query run that the notebook has in it
- Other query runs that users will get by changing random seeds
- What is the definition of egg-on-face?
Automatically checking for stability of our claims made in the notebook under building other release .bdb files we might reasonably build
- What is the definition of acceptable probability of getting a non-conforming build?

contrib.py should be broken into multiple files

One file per set of related commands.

Importing bdbcontrib takes a while

From @LuaC:

importing bdbcontrib takes a while?

`--colorby` argument for `.show` should work with stattypes other than NUMERICAL

move src/demo/ed25519 to external/ed25519/dist

...and manage it with vendor branches like we do for lemonade, plex, &c., in bayeslite.

.bar chops off half of x-axis

The .bar command only displays the first half of the rows in the query. I reckon it's an xlim issue; the rightmost bar is halved when the number of rows in the output is odd.

osx app: no module named ipykernel

[I 11:25:06.744 NotebookApp] Kernel started: b9215e58-990d-4371-8aaf-d763eb5e639f
/Applications/Bayeslite-bayes0.1+unknown.app/Contents/MacOS/venv/bin/python2.7: No module named ipykernel

However, my sys.path when run within the venv does not include anything from the app, so this might be an issue with my system configuration of virtualenv?

Performance issues and considerations for composer

The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.

Currently the composer takes in a n_samples parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.

register

No major concerns.

create_generator

No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly O(nm) ~ O(n^3) for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..

drop_generator

No major concerns. For large tables, dropping the internal crosscat metamodel has empirically been shown to non-negligible time, which the composer cannot change.

initialize_models

Runs initialize for crosscat (can be slow for large datasets).
Runs create and serialize for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.

drop_models

TODO.

analyze_models

No joint inference, just crosscat analysis.

column_dependence_probability

Simple graph walk in the DAG, roughly O(E). Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.

conditional_mutual_information

Super expensive. For simulate n_samples, we need to invoke _weighted_samples roughly n_sample^2 times -- the weighted sampler is approximate and we need n_samples to get one approximate sample from the posterior. We then invoke _joint_pdf four times.

_joint_logpdf

Super expensive. We need to compute the partition function (likelihood of the evidence constraints). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.

Note that there are no known algorithms for reusing the samples for QY and Y.

predict_confidence

Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the simulate.

simulate

Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).

row_similarity

Delegates to crosscat.

row_column_predictive_probability

Delegates to column_value_probability. I have issues with the query, see comment in the code.

(de)serializing foreign predictor binaries

Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.

Plots shown in IPython notebook are not resizable

Figure (pun intended) out a way to beautify the shape and size of plots returned by the api plotting utilities such that IPython notebook renders them nicely.

Consider adding a "drop table if exists X" example to the notebook

This may reflect my own relative lack of experience with databases relative to the expected user base, but including an example of e.g.:

q('drop table if exists satellite_purpose')

prior to creating a new table might save certain users some time in figuring out how to use the tool.

Save BQL and SQL traces and infrastructure to send on port

@tibbetts began this on this branch: 20150910-tibbetts-sessions. Currently sessions are being stored into the database in table "bayesdb_session_entries".

Steps:

bdbcontrib function for printing all of the sessions to screen.
modify function to send out on port
server running on csail to accept it

Legend of pairplot --colorby unhelpful

For some reason it now lists every available value with every color, instead of indicating which value corresponds to which color.

I think this happened when @gregory-marton and I were cleaning gen_collapsed_legend_from_dict but we were rushed and/or didn't notice at the time.

Strange ValueError -- I think commas is causing do_violinplot to crash

bayeslite> .show 'ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF malawi'
Traceback (most recent call last):
  File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
    return self.func(*args)
  File "/home/fsaad/Documents/pcp/bdbcontrib/hooks/hook_plots.py", line 163, in pairplot
    show_full=args.show_full)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 154, in pairplot
    show_full=show_full)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 757, in _pairplot
    colors=colors)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 569, in do_pair_plot
    ax = DO_PLOT_FUNC[hash(vartypes)](plot_df, vartypes, **kwargs)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 484, in do_violinplot
    color='SteelBlue')
  File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 336, in violinplot
    vals, xlabel, ylabel, names = _box_reshape(vals, groupby, names, order)
  File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 100, in _box_reshape
    vals = [np.asarray(a, np.float) for a in vals]
  File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: zygosity
could not convert string to float: zygosity

Convert the metadata format of examples/satellites/build_bdb.py to json

so it's machine readable. This is a residual task from #58.

do_query with CREATE TABLE causes a cursor reference to be printed to screen

Example: The following query would result in a printed line without the "_ = " prefix:

_ = do_query(satellites_bdb, '''
CREATE TEMP TABLE inferred_orbit AS
INFER EXPLICIT
anticipated_lifetime, perigee_km, period_minutes, class_of_orbit,
PREDICT type_of_orbit AS inferred_orbit_type
CONFIDENCE inferred_orbit_type_conf
FROM satellites_cc
WHERE type_of_orbit IS NULL;
''')

non-grody solution and automatic tests for seaborn .empty / dwimnonzero bug

.prettypony command to just draw some plot to make sure plotting is working

automatically test stability of phenomena documented in examples

We once had the ISS shown as weirdest by expected_lifetime; some time later we documented Sicron 1A as the weirdest; now it is not shown as weirdest.

(a) We need to determine how to assess the stability of phenomena for our demos.
(b) We need to find stable phenomena for our demos.
(c) We need to automatically test these in our demos.

make PDFs of example analyses

Complete integration of stability checking

Should move the reusable modules to bdbcontrib/src and document
them for the users also
Should parameterize the scripts and make Jenkins builds
- Parameterization: main function that takes programmatic
  parameters, another function that chews up commandline arguments,
  if main check that invokes it.
- A smoke test build that acts as a crash-level integration test of
  all this stuff, possibly from within check.sh depending on how
  slow it is
- A full build that produces reusable .bdb artifacts, certainly
  outside check.sh
  - Would be good if Jenkins would also serve the plots
Convert the metadata format (record_metadata) into something machine
readable, like json
- Is there a standard human-readable version of json for printing to
  screen?

.show breaks when a column appears twice

For example:

bayeslite> .show 'SELECT whz, muac, agemonths, diarrhea, fever, cough, diarrhea, vomiting, site FROM malawi;'
Traceback (most recent call last):
  File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 50, in __call__
    return self.func(*args)
  File "/home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib_plot.py", line 223, in pairplot
    colorby=args.colorby, show_missing=args.show_missing)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 466, in pairplot
    generator_name=generator_name)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 116, in get_bayesdb_col_type
    return guess_column_type(df_column)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 88, in guess_column_type
    pd_type = df_column.dtype
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2150, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'dtype'
'DataFrame' object has no attribute 'dtype'

We should at least make it fail gracefully, with an informative error message.