Giter Site home page Giter Site logo

probcomp / bdbcontrib Goto Github PK

View Code? Open in Web Editor NEW
9.0 9.0 6.0 30.25 MB

BayesDB contributions, including plotting, helper methods, and examples

Home Page: http://probcomp.csail.mit.edu/bayesdb

License: Apache License 2.0

Python 87.10% CSS 0.15% Shell 0.34% Makefile 0.43% Jupyter Notebook 11.98%

bdbcontrib's People

Contributors

alxempirical avatar axch avatar baxtereaves avatar gregory-marton avatar jayelm avatar riastradh-probcomp avatar tibbetts avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bdbcontrib's Issues

Introduce caption in plot commands of the BQL query.

Feature request: include an option to caption a plot with the the BQL query (i.e. --show-caption). For instance

.heatmap 'ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi' --last-sort

would have ESTIMATE PAIRWISE DEPENDENCE PROBABILITY FROM malawi as a caption. Long captions would have ellipses after 100 chars.

Many warnings when no shortnames are used

Are a dozen warning snecessary to indicate that there is no shortname for a column?

bayeslite> .ccstate states_cc 15
Warning: No shortname found for illiteracy. Using column name.
Warning: No shortname found for geo. Using column name.
Warning: No shortname found for lifeexp. Using column name.
Warning: No shortname found for livealone. Using column name.
Warning: No shortname found for murder. Using column name.
Warning: No shortname found for gdp. Using column name.
Warning: No shortname found for hsgrad. Using column name.
Warning: No shortname found for minority. Using column name.
Warning: No shortname found for rape. Using column name.
Warning: No shortname found for frost. Using column name.
Warning: No shortname found for assault. Using column name.
Warning: No shortname found for income. Using column name.
Warning: No shortname found for population. Using column name.
Warning: No shortname found for divorce. Using column name.

README should be updated since "contrib.py" no longer exists

.hook /absolute/path/to/contrib.py

is outdated after breaking up the files as suggested in (#2)

To reproduce:
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ bayeslite -m
No such file or directory /home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib.py

Make font size for all plotting commands adjustable

@vkmvkmvkmvkm

Can you make font-size adjustable as a command line flag for all plot commands in bdbcontrib? This is a real issue and its absence (+ the absence of useful defaults or a "--screen" and "--pdf" mode) should be considered a bug in the plotting code. Otherwise there's a valley of death between "iterating yourself" and "showing it to anyone else".

freedman diaconis binning is still broken (maybe just for categoricals?)

Performing .show 'SELECT inferred_orbit_type, inferred_orbit_type_conf, class_of_orbit FROM inferred_orbit;' --colorby class_of_orbit --filename ../build/satellites.tmp/fig_1.png
Traceback (most recent call last):
  File "/home/riastradh/bayesdb/master/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
return self.func(*args)
  File "../../hooks/hook_plots.py", line 163, in pairplot
show_full=args.show_full)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/__init__.py", line 70, in pairplot
return pairplot(*args, **kwargs)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 153, in pairplot
show_full=show_full)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 772, in _pairplot
bdb=bdb, generator_name=generator_name, colors=colors)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 408, in do_hist
bins = _safer_freedman_diaconis_bins(data_srs)
  File "/home/riastradh/bayesdb/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 362, in _safer_freedman_diaconis_bins
h = 2 * iqr(a) / (len(a) ** (1 / 3))
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/seaborn/utils.py", line 347, in iqr
q3 = stats.scoreatpercentile(a, 75)
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1522, in scoreatpercentile
return _compute_qth_percentile(sorted, per, interpolation_method, axis)
  File "/tmp/riastradh/20150915/venv/local/lib/python2.7/site-packages/scipy/stats/stats.py", line 1565, in _compute_qth_percentile
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
TypeError: can't multiply sequence by non-int of type 'float'

Our demos should not have large categoricals, because those mess us up

The operator_owner field of satellites is large in this sense.

It may also help for the schema to accept the size of categorical as a parameter (but: what to do if that's wrong? treat it as an upper bound?)

It would probably also help for GUESS(*) to surface the sizes of the categoricals.

.show --colorby argument should not be case-sensitive

For example

.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby Class_of_Orbit

works, but

.show 'SELECT Expected_Lifetime, dry_mass_kg, class_of_orbit, p_lifetime FROM predprob_life' --colorby class_of_orbit

raises a KeyError from pandas.

It makes sense that if column selection is case-insensitive, that the argument to --colorby should be case-insensitive as well.

accessing an axes with plt.subplot(plt_grid[a,b]) clobbers the axis

the issue is in

https://github.com/mit-probabilistic-computing-project/bdbcontrib/blob/54be59a6a6b3cc2ade929eeb0f1b4c1adfd7fc44/bdbcontrib/plotutils.py#L492

There is some note on the documentation of plt.subplot (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot) which might explain the issue
"Creating a new subplot with a position which is entirely inside a pre-existing axes will trigger the larger axes to be deleted"

.show without arguments kills the bayeslite shell

When entering ".show" without arguments, the bayeslite shell freaks out and terminates.

to reproduce:

fsaad@fsaad-xps:/Documents/pcp/crime/src$ bayeslite -m
added command ".register_bql_math_functions"
added command ".readtohtml"
added command ".nullify"
added command ".heatmap"
added command ".show"
added command ".ccstate"
added command ".histogram"
added command ".bar"
added command ".chainplot"
Welcome to the Bayeslite shell.
Type `.help' for help.
bayeslite> .show
usage: bayeslite [-h] [-f FILENAME] [-g GENERATOR] [-s] [--no-contour] [-m]
[--colorby COLORBY]
bql [bql ...]
bayeslite: error: too few arguments
fsaad@fsaad-xps:
/Documents/pcp/crime/src$ <--- bayeslite exits here --->

Session capture UI nits

  • A big, bold WARNING feels overkill for a non-session-breaking error,
    especially since the query that printed it was the one in which I
    fixed it.
  • Not telling me the object whose send_session_data method to call is
    somewhat underhelpful, but I don't know a good way to fix that

report cycle in foreign predictor composer in linear time

The foreign predictor composer currently takes O(|V| (|V| + |E|)) time to topologically sort its constituents, and then just says that there is a cycle if it finds one without saying anything about the cycle.

We can compute a topological sort and report where the cycle was, if we found one, in linear time using Tarjan's algorithm[1] to compute the strongly-connected components of a directed graph in topological order: if any components have size >1, those are cycles and we can report them in the error message.

[1] https://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm

Indexing error in barplot when not plotting two columns

From @vkmvkmvkmvkm:

Needed to ask Feras about why obscure indexing error in barplot was happening. Answer was I needed 2 columns from SELECT output (which makes perfect sense in hindsight). Also iterating was difficult --- needed to ask Taylor about the bdb.savepoint() business b/c of annoying "table already exists"... --- and finally, can't collapse into single SIMULATE due to surface syntax limitations (this should be signposted in an error with a TBD: properly re-embed SQL)

"_ = bdbcontrib.barplot(satellites_bdb, '''
SIMULATE purpose FROM satellites_cc
GIVEN country_of_operator = ""China (PR)""
LIMIT 1000;
''')"
"<...> /Volumes/BayesDB/BayesDB-bayes0.1rc1-1-gd96fe22.app/Contents/MacOS/venv/lib/python2.7/site-packages/matplotlib/backends/backend_agg.pyc in init(self, width, height, dpi)
92 self.height = height
93 if debug: verbose.report('RendererAgg.init width=%s, height=%s'%(width, height), 'debug-annoying')
---> 94 self._renderer = _RendererAgg(int(width), int(height), dpi, debug=False)
95 self._filter_renderers = []
96

ValueError: width and height must each be below 32768
"

operator_owner is independent of country_of_operator?

In the actual data, the relationship is nearly deterministic in one direction:

select operator_owner, count(distinct country_of_operator) as ct
from satellites
group by operator_owner
order by ct desc
limit 1

says "Ministry of Defense, 3", but crosscat almost uniformly decides not to cluster them. @vkmvkmvkmvkm says this is likely due to operator_owner being a large (346 items) categorical, thus messing everything up.

.mihist without arguments exists bayeslite shell

bayeslite> .mihist
usage: bayeslite [-h] [-f FILENAME] [-n NUM_SAMPLES] [-b BINS]
generator col1 col2
bayeslite: error: too few arguments
fsaad@fsaad-xps:~/Documents/pcp/crime/src$ < -- EXITED -->

Expand foreign predictors to be full foreign (conditional) metamodels (if desired)

Steps I forsee on the path:

  • Teach the predictors to implement the metamodel interface
    • Define a combinator named EZMetamodel or something (in
      bdbcontrib) that implements the metamodel interface by delegating
      to an object meeting the IBayesDBForeignPredictor{Factory,}
      interfaces.
  • Make the composer interact with predictors through the metamodel
    interface
    • Tweak register_foreign_predictor to take a metamodel object
      rather than a builder
    • predict_confidence calls the predictor's simulate and logpdf
      methods directly
    • _weighted_sample calls the predictor's simulate and logpdf
      methods directly
  • Incrementally generalize the composer to composing arbitrary
    (conditional) GPMs
    • Add predictor-specific create_generator calls
    • Add predictor-specific drop_generator calls
    • Generalize initialize_models to call sub-initializations
      • Will this spam the bayesdb generator table with conditional
        generators?
      • Will I need to call instantiate from the EZMetamodel?
    • Generalize drop_models to call sub-drops
    • Generalize to ensembling the FPs (and give them an option to
      declare that they are deterministic enough not to benefit from it)
      (or a user-facing option to indicate how big an ensemble to make)
    • Generalize the composer to handle multi-row queries
    • Generalize the composer to handle FPs with multiple target columns
      • Treat the set of targets as a supernode in the dag, in several
        places.
      • Could add more hair for permitting the user to request
        simulation of only some targets and not others.
    • Generalize to accept inference program arguments, both for
      selecting how to solve its own inference problems (e.g., whether
      to use likelihood weighting for situations where a predictor's
      output is constrained, and with how many trials), and for passing
      down to the FPs (if they accept inference program arguments
      themselves).
  • If desired, expand the FP interface to cover more of the ground that
    the GPM interface covers. Where, then, to draw the distinction?
    • Permit stochastic FPs
    • Permit per-row latents, which would make a distinction between
      simulating from observed and unobserved rows
    • Permit multi-row queries
    • Permit multiple target columns
      • The interface to predictor simulations changes to return a list of
        lists (all the targets jointly), but the existing ones all require
        one target and return just the one value.
      • Same in logpdf

Outstanding issues in composer

According @axch

  • Um, include the casefold in the presence check at composer.py:178,
    if that's what you meant.
  • Line 207: What are locls, fcols, pcols, and fpreds? Can we make the
    names more descriptive?
    • Might not be a bad idea to define a container class for the
      "hodgepodge", named something like "Configuration" or "Schema".
    • Actually, a namedtuple will probably do
  • What's with the comment on line 215? Is that an example value?
    If so, call it that and fix it (underscore)
  • Line 230: Why is this a bql query instead of a call to the metamodel
    API?
    • If it remains a BQL query, should quote the elements properly
  • Line 305: Quote the cc_name
  • Line 314: Why is this a BQL query intead of a call to the metamodel
    API?
    • Is there any robust way to pass the set of modelnos through, to
      maintain the intended 1-1 mapping between composer model numbers
      and underlying crosscat model numbers?
  • Line 514: Is this actually correct? Looks like a type error:
    checking whether a tuple is in "ignore", even though that
    accumulates just the columns. Would manifest as failing to ignore
    things it should ignore.
    • If you are not sure, this would be a good thing to poke with a
      test.
    • We also know how to build the ignore table better, per Taylor's
      review of predictive_probability.

Set up satisfactory production-size Jenkins builds of the satellites example

This is an outstanding task from #58. Stuff we might want to automate

  • Building a big-enough set of satellites bdbs for distribution and/or stability testing
  • Building stability assessments
  • Serving the resulting plots from Jenkins (how?) or from a Jenkins-writable location
  • Automatically checking for egg-on-face from releasing the release .bdb file
    • The particular query run that the notebook has in it
    • Other query runs that users will get by changing random seeds
    • What is the definition of egg-on-face?
  • Automatically checking for stability of our claims made in the notebook under building other release .bdb files we might reasonably build
    • What is the definition of acceptable probability of getting a non-conforming build?

.bar chops off half of x-axis

The .bar command only displays the first half of the rows in the query. I reckon it's an xlim issue; the rightmost bar is halved when the number of rows in the output is odd.

osx app: no module named ipykernel

[I 11:25:06.744 NotebookApp] Kernel started: b9215e58-990d-4371-8aaf-d763eb5e639f
/Applications/Bayeslite-bayes0.1+unknown.app/Contents/MacOS/venv/bin/python2.7: No module named ipykernel

However, my sys.path when run within the venv does not include anything from the app, so this might be an issue with my system configuration of virtualenv?

Performance issues and considerations for composer

The composer uses generic simple Monte Carlo estimates (likelihood weighting) of various information theoretic quantities required to implement BQL. The advantage of this approach is that the composer can answer ad-hoc quries with abitrary target and constarined nodes in the DAG without knowing the internals of its constituent GPMs. The downside is that some implementations are slow. This issue outlines key concerns on a method-by-method basis, with approximate complexity. There will have to be design decisions before releasing the code into the wild.

Currently the composer takes in a n_samples parameter to control the accuracy/time of each estimate. Future interface will make each query customizable through API or BQL.

register

No major concerns.

create_generator

No major concerns. One topological sort of the DAG is performed, using an adjacency list representation roughly O(nm) ~ O(n^3) for a dense graph, but hardly every a problem unless one has an unusually large number of FPs..

drop_generator

No major concerns. For large tables, dropping the internal crosscat metamodel has empirically been shown to non-negligible time, which the composer cannot change.

initialize_models

Runs initialize for crosscat (can be slow for large datasets).
Runs create and serialize for each foreign predictor (scales with train time of FP), then inserts of the binary into the sql database.

drop_models

TODO.

analyze_models

No joint inference, just crosscat analysis.

column_dependence_probability

Simple graph walk in the DAG, roughly O(E). Currently we don't cache intermediary results in the recursion -- might be necessary for large number of columns.

conditional_mutual_information

Super expensive. For simulate n_samples, we need to invoke _weighted_samples roughly n_sample^2 times -- the weighted sampler is approximate and we need n_samples to get one approximate sample from the posterior. We then invoke _joint_pdf four times.

_joint_logpdf

Super expensive. We need to compute the partition function (likelihood of the evidence constraints). One possible solution is to kill the computation of the evidence (2x speedup) and only return unnormalized values for continuous values, since densities are mostly useful for comparison.

Note that there are no known algorithms for reusing the samples for QY and Y.

predict_confidence

Might be expensive. For a child nodes, we need to impute all the missing parents, which for continuous values is typically slow. For predicting a column modeled by a foreign predictor, we need to invoke the simulate.

simulate

Expensive. Because the sampler is approximate, we need a large number of weighted samples for 1 approximate sample to return (empirically, 1 appx sample needs ~200 weighted samples).

row_similarity

Delegates to crosscat.

row_column_predictive_probability

Delegates to column_value_probability. I have issues with the query, see comment in the code.

(de)serializing foreign predictor binaries

Deserialized FP binaries are cached in memory per-bdb session, rather than loaded from the database on-query demand. I do not anticipate this caching to cause any noticeable overhead.

Consider adding a "drop table if exists X" example to the notebook

This may reflect my own relative lack of experience with databases relative to the expected user base, but including an example of e.g.:

q('drop table if exists satellite_purpose')

prior to creating a new table might save certain users some time in figuring out how to use the tool.

Legend of pairplot --colorby unhelpful

For some reason it now lists every available value with every color, instead of indicating which value corresponds to which color.

I think this happened when @gregory-marton and I were cleaning gen_collapsed_legend_from_dict but we were rushed and/or didn't notice at the time.

Strange ValueError -- I think commas is causing do_violinplot to crash

bayeslite> .show 'ESTIMATE DEPENDENCE PROBABILITY FROM PAIRWISE COLUMNS OF malawi'
Traceback (most recent call last):
  File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 57, in __call__
    return self.func(*args)
  File "/home/fsaad/Documents/pcp/bdbcontrib/hooks/hook_plots.py", line 163, in pairplot
    show_full=args.show_full)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 154, in pairplot
    show_full=show_full)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 757, in _pairplot
    colors=colors)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 569, in do_pair_plot
    ax = DO_PLOT_FUNC[hash(vartypes)](plot_df, vartypes, **kwargs)
  File "/home/fsaad/pcp/bdbcontrib/build/lib.linux-x86_64-2.7/bdbcontrib/plot_utils.py", line 484, in do_violinplot
    color='SteelBlue')
  File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 336, in violinplot
    vals, xlabel, ylabel, names = _box_reshape(vals, groupby, names, order)
  File "/usr/local/lib/python2.7/dist-packages/seaborn/distributions.py", line 100, in _box_reshape
    vals = [np.asarray(a, np.float) for a in vals]
  File "/usr/lib/python2.7/dist-packages/numpy/core/numeric.py", line 460, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: zygosity
could not convert string to float: zygosity

do_query with CREATE TABLE causes a cursor reference to be printed to screen

Example: The following query would result in a printed line without the "_ = " prefix:

_ = do_query(satellites_bdb, '''
CREATE TEMP TABLE inferred_orbit AS
INFER EXPLICIT
anticipated_lifetime, perigee_km, period_minutes, class_of_orbit,
PREDICT type_of_orbit AS inferred_orbit_type
CONFIDENCE inferred_orbit_type_conf
FROM satellites_cc
WHERE type_of_orbit IS NULL;
''')

automatically test stability of phenomena documented in examples

We once had the ISS shown as weirdest by expected_lifetime; some time later we documented Sicron 1A as the weirdest; now it is not shown as weirdest.

(a) We need to determine how to assess the stability of phenomena for our demos.
(b) We need to find stable phenomena for our demos.
(c) We need to automatically test these in our demos.

Complete integration of stability checking

  • Should move the reusable modules to bdbcontrib/src and document
    them for the users also
  • Should parameterize the scripts and make Jenkins builds
    • Parameterization: main function that takes programmatic
      parameters, another function that chews up commandline arguments,
      if main check that invokes it.
    • A smoke test build that acts as a crash-level integration test of
      all this stuff, possibly from within check.sh depending on how
      slow it is
    • A full build that produces reusable .bdb artifacts, certainly
      outside check.sh
      • Would be good if Jenkins would also serve the plots
  • Convert the metadata format (record_metadata) into something machine
    readable, like json
    • Is there a standard human-readable version of json for printing to
      screen?

.show breaks when a column appears twice

For example:

bayeslite> .show 'SELECT whz, muac, agemonths, diarrhea, fever, cough, diarrhea, vomiting, site FROM malawi;'
Traceback (most recent call last):
  File "/home/fsaad/pcp/bayeslite/build/lib.linux-x86_64-2.7/bayeslite/shell/hook.py", line 50, in __call__
    return self.func(*args)
  File "/home/fsaad/Documents/pcp/bdbcontrib/bdbcontrib/contrib_plot.py", line 223, in pairplot
    colorby=args.colorby, show_missing=args.show_missing)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 466, in pairplot
    generator_name=generator_name)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 116, in get_bayesdb_col_type
    return guess_column_type(df_column)
  File "/home/fsaad/pcp/bdbcontrib/bdbcontrib/plotutils.py", line 88, in guess_column_type
    pd_type = df_column.dtype
  File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2150, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'dtype'
'DataFrame' object has no attribute 'dtype'

We should at least make it fail gracefully, with an informative error message.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.