coreycole / hypdb Goto Github PK

Python 82.85% C 15.77% C++ 1.38%

hypdb's Introduction

HypDB

The core HypDB module lives in the HypDB/ directory. A web UI demo that demonstrates the capabilities of HypDB lives in the demo/ directory.

PyPI

Our package is published on PyPI here.

Paper

Our paper (published in VLDB 2018 in Rio de Janeiro) can be found here.

Contributing

We follow angular-style commit message guidelines.

To write code to solve an issue, branch off from master and name the branch with something unique and descriptive. We may open a PR at any stage of solving an issue, but request for code review when it might be ready to merge back into master. We can close the respective issue once the PR has been merged.

hypdb's People

Contributors

Stargazers

Watchers

Forkers

uwdb danthe96 cemeka y12uc231

hypdb's Issues

feat(client/server): user can specify top k fine-grained explanations

We can calculate top 10 on the backend, show top 3 on the frontend by default, and allow the user to change k and we can show/hide up to 10 on the frontend

feat(server): return query for further group-by most responsible

right now we have the bar chart for this, but not the sql

feat(server): output JSON for bar charts

In order to display the information like average treatment effect, split across the grouping attribute, we need to output JSON to be passed to ngx-charts. You can see and example of what that looks like in ZaliQL here in html and here in data

bug: capitalization of where clause matters

We might need to normalize everything to lowercase, starting at the upload phases

get_respon can return outcome/treatment as the most responsible

breaks the naive_groupby for the second bar chart

feat(server): output graph JSON for covariate discovery algorithm

In order to display the results of HypDB's covariate discovery algorithm, we need to output metadata for the function call that we can easily pass to the graph visualization library we decided on in #1

feat(frontend): remove treatment / outcome from labels in bar chart

Just leave the actual column name

tools(hypdb): add automated tests

This will be a big one, we should divide and conquer @pzli3

bug(frontend): user can upload invalid / non csv files

feat(server): numpy numbers are not json serializable

cast numbers to int/float before the json.dumps

chore: setup project scaffolding

We need to set up the basic scaffolding for the project and get the client ready to talking back and forth with the server.

refactor(hypdb): clean out unused code

feat(client): display bar charts

After completing #4, we will need to display these bar charts.

chore(hypdb): research PyPI package contraints

We need to figure out what exactly we need to do to release a package on PyPI and what constraints we need to operate in.

feat(client): page structure

As per babak:

We should have the rewrite query answer right below of naive query answer graph. Then the explanation. Then graph and rewriten query come at the bottom of the page.

I think by "rewrite query" he means the further group by most responsible. I'm not sure what he means by the explaination, the coarse & fine-grained for the most responsible? The graph needs to be the last thing in the page with nothing next to it (it's really tricky to position the graph).

It seems to me like the page can be laid out as followed:

row 1 - naive query, put naive query answer to right on same row
row 2 - further group by query, put query answer to right on same row
row 3 - total effect query, put query answer to right on same row
row 4 - direct effect query, put query answer to right on same row
row 5 - coarse and fine-grained analysis
row 6 - causal dag

can you clarify/confirm my proposed page order @bsalimi ?

feat(bias): make grouping attribute optional

Right now, it seems that the grouping attribute is required because there is always a pandas call to do the group by. If there is no grouping attribute, we should modify the bias query to just compare all treated to all control.

bug(client): fine grained attribute data not shown sometimes

When running lungcancer.csv, smoking -> lung_cancer the fine grained results aren't showing sometimes. It looks like a frontend issue because the json seems correct (see attached image)

feat(server): bar chart answer for direct effect

feat(client): upload subset of csv columns

refactor(hypdb): add type hints

We should implement PEP 561

refactor(hypdb): move core hypdb code into separate top level folder

I think we should avoid creating two separate repositories for testing and issue tracking reasons.

We might want the structure to be:

demo
- client
- server
hypdb
PyPI configuration files

The hypdb directory would be where all the core code would be, but the PyPI configuration files would all be top level. When we make the project and prepare it for release on PyPI, we would omit the demo code and make the release as small as we can. I don't think it needs to be in a separate repository to do this.

bug(server): 2d bar chart ugly when responsible has many levels

legend is also not readable in this case

adult data, sex -> y

docs: digest the jupyter notebook and add comments

Definitely make note of inputs and outputs to different functions. What order they will all be called in and different stages/groups that we can call them in.

I'm assuming the basic average differences can come first so the user can see the results of their query before HypDB comes in an explains it away.

chore(client): add dependencies for dagre-d3

feat(server/client): change dag library

This will most likely be on the server and send an image to the client. Need to research which one to use.

This will replace the current dag library on the frontend.

bug(server): generated queries need line length limit

In order to have enough room to display bar charts on the same line as their respective queries, especially for data with multi-level treatment, we need to limit the maximum length of query lines.

See attached image

feat(client): copy UW color scheme over from ZaliQL

feat(server): generated query and bar chart answer for total effect

tools(hypdb): add travis CI

Once we have linting, type checking and automated tests all setup we should run these in travis. This will be helpful in the future when people would like to contribute to the project.

Blocked by #57 #58 #59

tools(hypdb): add flake8 and fix all issues

feat(client): query validation

After the user inputs a query, we should have client-side validation that will throw an error if the input query is invalid. For the first iteration of this feature, we don't have to worry about making the error messages very custom/helpful.

Blocked by #5

feat(client/server): upload csv file

An uploaded file should be saved on the server in some directory that is in the .gitignore

It should convert to JSON upon upload. It should expect the first line to be the column headers. It should confirm the column headers with the user before uploading.

feat(server): resource for returning csv data as json

This resource should return a json array representing the rows in the csv that was uploaded by the user. This will be used for #5 and #6

chore: setup docker for demo

For people to easily try out the demo without needing to install numerous dependencies, we will use docker (just like ZaliQL). It doesn't seem in the critical path to do this quite yet, so I'm leaving it out of project v1.

feat(server): integrate covariate discovery algorithm into web server

The covariate discovery algorithm should be broken off into directory server/lib but there should be a resource in server/resources that is responsible for calling those covariate discovery functions given some JSON inputs through the web service.

bug(server): bar char data incorrect

Right now, the ATE only returns the bar chart data for the most fine-grained chart. If I query with grouping attribute carrier I get an array of bar chart data to display. This should really be a 2D array of length 1, where the 0th element is the array I'm getting.

If I query with grouping attributes carrier and origin I should get a 2D array of length 2, where the 0th element is the array I got with only carrier as the grouping attribute, and the 1st element is the more fine-grained ATE (the array we currently get when calling with 2 grouping attributes).

In summary, we need to return the ATE for each grouping attribute starting with the first. This will always be a 2D array with length >= 1

SELECT avg(departure_delay)
FROM flights
WHERE airport='JFK'
GROUP BY airline

We need to further decide how exactly we want users to enter in something like this. To keep it simple, I'm thinking

drop-down menu for the outcome column (departure_delay)
drop-down menu for the csv file
text entry for the WHERE clause with an arbitrary number of AND and OR
drop-down menu for the groupin_attribute column (airline)

bug(frontend): query misc.

query text not updating after selecting outcome
query not clearing after changing file
where clause shows up in query text even when no condition is provided

bug(client): single quotes in where clause cause double quoting from where parser

we should prevent the user from writing single quotes in their where clause