mlresearchatosram / cause2e Goto Github PK

The cause2e package provides tools for performing an end-to-end causal analysis of your data. Developed by Daniel Grünbaum (@dg46).

Home Page: https://gitlab.com/causal-inference/working-group/-/wikis/home

License: MIT License

Python 99.43% Makefile 0.26% Batchfile 0.31%

causal-analysis dowhy domain-knowledge tetrad python causal-reasoning causal-inference causal-models causality

cause2e's People

Contributors

Stargazers

Watchers

Forkers

mshoush smasis001 saiddddd dg46

cause2e's Issues

Sensitivity Analysis

Currently we only get point estimates of the causal effects, assuming that our causal graph is flawless. If the graph is wrong, we have no idea how bad our estimation results can get.

It would lend additional credibility to our analyses if we could specify multiple possible graphs (e.g. because we are not sure about the presence of one edge), estimate the causal effects based on each of the graphs and return something like a confidence interval instead of the current point estimates. A visual representation could be added to the automated pdf report.

I already have implemented a prototypical solution, just need to refactor and integrate it properly.

PySpark dependency

Cause2e is a lightweight package except for the dependency on PySpark. Can we make this optional, given that most users will not really need it?

Batch methods for non-linear estimation

The functionality for estimating multiple causal effects at once and triggering cause2e's reporting capabilities is currently only implemented for linear estimation methods. At least for the ATE, it should be realistic to expand the functionality to some non-linear methods.

Bug in edge analysis

If no remaining allowed edges exist, the result manager throws an error when saving the edge analysis to png. Might also happen for other boundary cases. Will fix it by adding an additional check for empty information.

Improve automated reporting

The automated pdf report is a helpful summary of the causal analysis. However, it has not been updated since the original proof of concept and could be rendered more visually pleasing and easier to understand.

Add short descriptions to every page (e.g. explain the color coding in the graphs).
Check if we can group all heatmaps on one page, all full tables on one page etc. for an easier overview.
Polish axis labels and figure titles wrt. font size and text placement.
Check if additional information should be added to the report.

Exception handling

Currently, we have only minimal handling of exceptions. Given that causal analyses with cause2e have a mostly fixed structure, it would be helpful for the users to receive clearer feedback, at least whenever their input for the main analysis methods cannot be processed as desired.

Steps:

identify frequent sources of errors
write custom exception types for them to avoid confusion with other errors
handle each exception in a way that enables the users to adapt their input accordingly (or inform them whenever the desired functionality is implemented)

Check variables names when passing domain knowledge

I have had repeating issues caused by typos in the variable names when passing domain knowledge.

These errors are hard to find manually after you've made them, but it is trivial to detect them in an automated way: For every variable that is used when passing domain knowledge, check if it is actually the name of a data column; otherwise, raise an error.

This check should be part of the "set_knowledge" method of the learner.

size of causality matrix

Windows installation

Continuous integration pipeline works on ubuntu image, but changing to windows image results in errors (first is Java related, then some issue with multithreading in pytest when the unit tests are run). I am using cause2e on a windows machine for my own analyses, so I know that it can work. Will fix the pipeline so that CI ensures that cause2e is in a usable state for linux and windows users.

Test coverage

Test coverage should be increased to at least 80% (currently about 40%). The full end-to-end analysis from reading data to generating the summary pdf of the analysis should be covered in the tests.

Replace py-causal by causal-learn?

Causal-learn "is a python package for causal discovery that implements both classical and state-of-the-art causal discovery algorithms, which is a Python translation and extension of Tetrad."

If it provides the same functionality as the (Java wrapper) py-causal library, but with the algorithms actually implemented in Python, this would free cause2e of any Java dependencies. I need to check out the package in more detail to see if it is that easy.

Allow Java VM restart

Currently, the Java VM cannot be restarted after it is shut down.
This limitation can be avoided if we run all tasks that require the VM in a separate process using the multiprocessing module: LeeKamentsky/python-javabridge#88
This would also allow us to run many causal discovery procedures in parallel. A downside is that the start of a new process adds additional overhead that will affect the runtime of an analysis.

Continuous integration failing

Some tests do not pass, likely related to pandas updating to 2.0.0 (breaking changes).
Warning says that we are passing set as indexer to dataframe, which is no longer supported: pandas-dev/pandas#42825
Try fixing it by requiring older pandas version in requirements.txt.
If this does not work, manually either convert the sets to lists in our code or patch Loc as suggested here: pandas-dev/pandas#42825 (comment)