emjun / tisane Goto Github PK

Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

License: Apache License 2.0

Python 80.55% Jupyter Notebook 19.45% CSS 0.01%

statistics statistical-analysis linear-regression linear-models generalized-linear-mixed-models generalized-linear-models domain-specific-language statistical-validity

tisane's People

Contributors

Stargazers

Watchers

Forkers

audreyseo leibatt

tisane's Issues

Declare classes as abstract classes

Use Python 3 pattern:

from abc import ABCMeta, abstractmethod

class ClassName(object, metaclass=ABCMeta):
....

Knowledge Base Query Results

Need to make sure that the atoms in the Knowledge Base come back as unspecified when they are necessary and we can verify them somehow --> useful logical formulation
Currently, the results depend on SAT result. It may be possible, however, to have SAT with missing atoms (see above note). We need to make sure this doesn't happen. If it does, we need to parse/interpret result without relying on SAT.

There is one primary directive in Tisane. The directive instructs Tisane to synthesize a statistical model based on a set of variables and their conceptual and data measurement relationships, as expressed in Tisane (#20).

End-users must construct a study design to pass to the directive:

design = ts.Design(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv') 

ts.synthesize_statistical_model(design)

Idea: An alternative may be to use the design data structure internally but only require the end-user to provide a set of variables:

ts.synthesize_statistical_model(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv')

This directive emphasizes one conceptual difference between Tisane and other statistical modeling tools:

Unlike other tools that require end-users to fully-specify their statistical models mathematically, Tisane figures out the statistical model effects structure based on conceptual and data measurement relationships, asking the user for input in the face of modeling ambiguity. Tisane requires end-users specify the dependent and independent variables and any relevant conceptual and measurement relationships.

In an evaluation, it would be great to learn how intuitive/easy it is to use Tisane's API: Is expressing a set of relationships easier than expressing a statistical model mathematically?

[RFC] Causes vs. Associates_with

Tisane currently provides two types of conceptual relationships: causes and associates_with. This doc covers when and how to use these verbs.

If a user provides associates_with, we walk them through possible association patterns to identify the underlying causal relationships. In other words, associates_with indicates a need for disambiguation to compile to a series of causes statements.

To do this well, we need to resolve two competing interests: causal accuracy and usability. Prioritizing causal accuracy, the system should help an analyst distinguish and choose among an exhaustive list of possible causal situations. However, doing so may be unusable because the task of differentiating among numerous possible causal situations may be unrealistic for analysts unfamiliar with causality. These concerns do not seem insurmountable.

With an infinite number of hidden variables, there are an infinite number of possible causal relationships. We could restrict the number of hidden variables an analyst considers. This decision compromises causal accuracy for usability. If we had a justifiable cap on hidden variables, it may be worthwhile to take this approach.

Another perspective: If the goal is to translate each associates_with into a set of causes, why provide associates_with at all?

The primary reason I wanted to provide both was because of the following:

Analysts are sometimes unsure about the causal edges in their conceptual models. This uncertainty can be due to their own lack of knowledge or because the relationships are hypothesized but not known and now the analysts want to see if data supports the hypothesized relationships.
There may be a lack of definitive evidence in a domain about some causal edges and paths (that may involve multiple variables).

In all these cases, it seems important to acknowledge what is known, what is hypothesized/the focus of inquiry, and what is asserted for the scope of the analysis. (accurate documentation, transparency)

In the current version of Tisane, analysts can express any relationships they might know or are probing into using causes. If analysts do not want to assert any causal relationships due to a perceived lack of evidence in their field, they should use associates_with. Whenever possible, analysts should use causes instead of associates_with.

Tisane's model inference process makes argubaly less useful covariate selection recommendations based on associates_with relationships. Tisane looks for variables that have associates_with relationships with both one of the IVs and the DV. Tisane suggests these variables as covariates with caution, including a warning in the Tisane GUI and a tooltip explaining to analysts that associates_with edges may have additional causal confounders that are not specified or detectable with the current specification.

For the causes relationships, Tisane uses the disjunctive criteria, developed for settings where researchers may be uncertain about their causal models, to recommend possible confounders as covariates.

We assume that the set of IVs an end-user provides in their query are the ones they are most interested in and want to treat as exposures.

What happens if the initial choice of variables could lead to confusion in interpretation of results?
We currently treat each IV as a separate exposure and combine all confounders into one model. In some cases, this may lead to interpretation confusion. For example, if the model includes two variables on the same causal path, one of the variables may appear to have no effect on the outcome even if it does (due to d-separation). We currently expect analysts to be aware of and interpret their results accurately in light of their variable selection choices. In their input queries, analysts should include only the variables they absolutely care the most about in their queries.

Moving forward

I would like to see the following (working list, no priority given yet):

Tisane: Separate out use cases and provide language constructs for each: lack of knowledge vs. hypothesized causal edges vs. lack of definitive evidence in the domain.
- Language design: Remove associates with, require only causes
- Provide a "gallery" or "library" of canonical graph shapes/statements they could adapt.
- Allow for inclusion of hidden variables?
- Generate multiple linear models to verify the input DAG/mechanism validation.
- Enforce variable selection that guarantees accurate inference, not just DAG/mechanism validation.
A question I keep coming back to: How usable is causal modeling to non-experts and how can we make it more usable to them?
- Study/find out what makes stating "causes" statements difficult for researchers and how to constructively support their skepticism rather than allowing them to avoid formalizing their knowledge.

Implementation changes:

[BIG] Thomas R. had some hesitation about the theoretical soundness of the disjunctive criteria. He did not expand much, but I hope to meet with him early in the winter quarter to discuss.
I could re-implement Tisane in R so that it uses Daggity under the hood. Would have to see how to use Daggity under the hood in Python.
In R, I've never created a widget/plug in, but I can look into how to do that.
Both R and Python versions could use a code-only interface, not having to rely on the GUI.

Follow-up work/Paper ideas:

Eval and improve conceptual modeling language
Eval Tisane vs. R

Weird error for Effects

In EffectSet class, it's possible to have self.interaction = InteractionEffect(effect=None) . This should be self.interaction = None.

I don't think having effect = None makes sense.

Changing this requires updating the has_<main/interaction/mixed>_effects() functions.

🐛 Unnecessarily ask for end-user input [Design -> StatisticalModel]

When an end-user does not include a variable as a main effect or interaction effect (not at all in the final statistical model), do not ask about data transformations!

Test:

no main effect
no interaction effect
no mixed effect?

Graph created from a Design is empty

For example, if we take the following code:

    def test_more_complex(self):
        student = ts.Unit(
            "Student", attributes=[]
        )  # object type, specify data types through object type
        race = student.nominal("Race", cardinality=5, exactly=1)  # proper OOP
        ses = student.numeric("SES")
        test_score = student.nominal("Test score")
        tutoring = student.nominal("treatment")
        race.associates_with(test_score)
        student.associates_with(test_score)
        race.moderate(ses, on=test_score)
        design = ts.Design(dv=test_score, ivs=[race, ses])
        gr = design.graph
        print(gr.get_nodes())
        self.assertTrue(gr.has_variable(test_score))

the print will print an empty list, and the assertion will fail. This seems to be because graph.py requires relationships from tisane.og_variable instead of from tisane.variable, and tisane.design calls tisane.graph.Graph.add_relationship to add edges:

# from tisane/graph.py
    def add_relationship(
        self, relationship: Union[Has, Treatment, Nest, Associate, Cause, Moderate]
    ):
        if isinstance(relationship, Has):
            identifier = relationship.variable
            measure = relationship.measure
            repetitions = relationship.repetitions
            self.has(identifier, measure, relationship, repetitions)
        elif isinstance(relationship, Treatment):
            identifier = relationship.unit
            treatment = relationship.treatment
            repetitions = relationship.num_assignments
            self.treat(unit=identifier, treatment=treatment, treatment_obj=relationship)
        # ...

The types for relationship are imported from tisane.og_variable, which means that none of the relationships are added as edges.

Update API to focus on variable relationships

The most recent revision attempts to make variable relationships clearer and obvious from the syntax. A nice consequence of this revision is that the conceptual differences between Tisane and existing software tools are more apparent.

Variables

An end-user expresses variables according to their data type. If the end-user later provides the data, the variable names should be the column names. For nominal or ordinal data, end-users also must specify the cardinality of variables if they do not intend to provide data. If end-users provide data, cardinality information is not required. In this case, Tisane will calculate and populate these fields internally.

Variables are observed values of a measure. Variables can be measures of interest, as in dependent and independent variables. Variables can also be id numbers that act as keys to a dataframe (e.g., participant id).

import tisane as ts

# Example 1: 
hw = ts.Numeric('Homework') # 'homework' is the column name
race = ts.Nominal('Race', cardinality=5) # there are 5 groups/options for the variable race
math = ts.Numeric('MathAchievement') 
mean_ses = ts.Numeric('Mean_SES')
student = ts.Nominal('student id', cardinality=100) # IDs 100 students included in this study 
school = ts.Nominal('school', cardinality=10) # IDs for schools, 10 students/school

# Example 2: 
leaf_length = ts.Numeric('length')
fertilizer = ts.Nominal('fertilizer condition', cardinality=2)
season = ts.Nominal('season', cardinality=4)
plant = ts.Nominal('plant id') 
bed = ts.Nominal('plant bed')

An end-user expresses relationships between variables that are related to domain theory (conceptual models) and data measurements.

Conceptual Relationships

There are two types of conceptual relationships: cause and associates_with

# Example 1
hw.cause(math) # Hours spent on homework causes math achievement. 
race.associates_with(math) # Math scores and race are associated with each other. 

# Example 2
fertilizer.cause(leaf_length) # Fertilizer causes leaf growth

Definitions:

cause: The LHS variable causes the RHS variable. The RHS variable cannot also cause the LHS variable.
associates_with: The LHS and RHS variables are associated/related in some way that is not causal.

Tisane provides aliases to both: causes and cause and associate_with and associates_with

Data measurement relationships

There are three types of data measurement relationships: (1) measurement attribution, (2) treatment for experiments, and (3) data hierarchies.

Measurement attribution

# Example 1: 
student.has(hw)
student.has(race)
student.has(math)
school.has(mean_ses)

# Example 2: 
plant.has(leaf_length)

Definition:

has distinguishes "levels" of observations by attributing variables to each level. In Example 1, there are two levels: student and school. Each student has a value for homework, race, and match. Each school has a value for mean_ses.

Idea: Create a separate Data type for "ID" and enforce that only variables of type "ID" can have other variables.

Treatment

End-users can express experimental treatments/manipulations.

# Example 2: 
fertilizer.treats(bed)

Only Example 2 is an experiment. Each bed is treated with a fertilizer. In other words, fertilizer is a bed-level manipulation.

Definition:

treats expresses the explicit/intentional manipulation of variables in an experiment. X.treats(Y) is internally equivalent to Y.has(X), which means that each Y has an observation for X.

Idea: Check that the LHS variable of treats has a causal relationship (in the graph) with the DV? And keep treatsandhas` different from one another.

Data hierarchies

Data can be clustered or nested. Tisane provides support for expressing two possible sources of clustering: (1) repeated measures and (2) nested relationships.

# Example 1 
student.nest_under(school) # Students belong to a school. Students within a school might also cluster more than between schools. 

# Example 2 
plant.nest_under(bed) # Plants belong in plant beds. 
plant.repeats(measure=leaf_length, repetitions=season) # Repeatedly measure the same plant once per season

Definitions:

nest_under nests one variable under another.
repeats means the LHS variable provides multiple values of the measure. Each value is enumerated/indexed by the repetitions variable (e.g., season). If a plant provides multiple measures per season, another column for indexing each measure is required.

Moderation on Nominal Not Working

import tisane as ts
import pandas as pd
import os


FILE_NAME = "schools.csv"

dir = os.path.dirname(__file__)
df = pd.read_csv(os.path.join(dir, FILE_NAME))


# Initialize Units
school = ts.Unit("schid", cardinality=10)
student = ts.Unit("stuid", cardinality=96)


homework = student.ordinal("homework", order=[0, 1, 2, 3, 4, 5, 6, 7])

# school variables
school_size = school.ordinal("scsize", order=[2,3,4,6])
school_region = school.nominal("region")
school_type = school.ordinal("sctype", order=[1, 4])
public = school.nominal("public")

# Define relationships
public.causes(homework)
public.moderates(school_region, on=homework)
school_size.associates_with(homework)
school_type.associates_with(homework)

design = ts.Design(dv=school_size, ivs=[school_region]).assign_data(df)

ts.infer_statistical_model_from_design(design=design)

When update concepts, not reflected in ConceptGraph

This was a bug.

Seems that issue was that we were deepcopying graphs and nodes when updating the graphs, so the references to objects in the test cases were "older" objects that we updated but were not in the ConceptGraphs. I removed deepcopies in the latest commit.

Automatic data schema detection

Rather than requiring end-users to specify the categories in a nominal variable, for example, can we detect the categories and then ask an end-user to verify or at least assume that data is clean and then not require interaction?

This seems particularly helpful for an interactive system...

Better type annotations and type checking

function parameter types, return types, etc.

Broken links in PyPi

The links for Jump to see a tutorial here and see some examples here appear broken on https://pypi.org/project/tisane/.

Add info about Dot graph generation to GRAPH_VIS.md

Change code generation module

The docs are inconsistent

API_OVERVIEW.md contains a "Has" relationship which is no longer supported.

Knowledge Base Class

Move some functions (e.g., get_concept_constraints and get_effect_set_constraints) into a helper file. These are not essential to KnowledgeBase class although they are helpful for using the KnowledgeBase

Interactions to support/incorporate

If declare concepts but not variables (esp. data type), ask user for input rather than error out. --> Some kind of hook for checking and then asking for additional input.
Ex. When querying knowledge base, check that Variable is declared before verifying/solving. If Variable is not declared, interactivity
Elicit weights for edges that represent causes, correlations that want to include in the final statistical model and then maximize these?
Presenting differences between models (set of effects for each model may differ): Visual pattern matching (make similarities easy to look over, differences more prominent, something from NLP/error analysis lit?

Milestones for interaction:
Low fidelity pass for interactivity: command line interaction (pdb)

Implementation ideas:

Could we use the listener pattern somehow such that missing info is logged somewhere, updates to info are logged somewhere, etc.?

Collecting Assertions

Right now, collecting and formatting assertions from properties of variables and effects sets is blended together into one step (see Tisane.collect_assertions and helper methods)

Seems to me that the right/better design would be:
collect assertions -> dict [in Tisane class where going to be controlling modeling?]
format assertions -> can pick ASP vs. SMT (hypothetically) although we are focusing on ASP [helper method]
query with assertions -> knowledge base [KnowledgeBase is just a wrapper around the ASP solving process]

Pro:

makes code more modular and therefore more extensible
Con:
more work to do 😛
update, add new tests

Related Issues: Issue #9, #13

Tests to add

Concepts: getting variable names

In general, create tests for each class, make sure include tests for each function

Fixes to graph visualization in tikz + dot!

causes and associates edges should be the same style (solid) and different from has (dotted), nests (dashed)
dependent variables should be filled in with "light-ish" grey (i.e., grey!30) -->
def get_causes_associates_tikz_graph(self) --> def get_causes_associates_tikz_graph(self, dv: AbstractVariable)

Algorithm for (Interactive) solving for valid statistical models

Similar to "sign posting" programming method

Start modeling --> returns intermediate (file, assertions collected so far)

collect all assertions from variables, effect set

Finish modeling --> takes start modeling output as input and incrementally calls solver; if no error, return final stdout/call return StatisticalModels

query and collect assertions that need to know in order to pick valid statistical model

Inferring candidate main effects

As of Dec. 03, relying primarily on conceptual graph to suggest main effects. Could consider additions or alternatives, such as :

checking for type of edge (causal, correlational) in conceptual graph
looking at the dataset
may depend on "task" for explanation or prediction

Checking for data properties/assertions

DSL:

provide some sort of documentation to show which properties can apply.

Interface:

fade out the invalid assertions -- E.g., if variable is nominal, cannot be normally distributed.

Tolerance for statistical tests that check for data properties

Ideas:

Use some p-value range
Use a threshold for test statistic values (lower and upper bound)
Some kind of global optimization for tolerance that is allowed in model as a whole?

RFC on strategy/design doc for Tisane R

Goal: Create an R version of Tisane

Considerations:

Keep user-facing API as R-idiomatic as possible
Reduce duplicate maintenance efforts. So that changes to Tisane in Python will improve/update Tisane R package.

Pipeline at 10,000 ft

R API (user input code) --> Python script --> JSON --> Tisane GUI --> R Code (output statistical modeling code)

Note: The new part is R API --> Python script. The rest is already how Tisane (Python implementation) works.
Put another way, the goal is to "transpile" R into Python.

How to compile/transpile R into Python?

Strategy 1: Build up internal graph IR in R, traverse graph to produce Python code
Strategy 2: Parse R script into AST, traverse AST, generate Python code from AST
Strategy 3: Build up internal graph IR in R, output graph IR in some format (maybe DOT or something like that), read in graph output, write Python code from graph

In all of these: Key thing is to control Python script execution through a bash script, which we can call from R.

Current/next steps

As of January 18, 2022: I opt for Strategy 1 first because (i) I suspect the syntax of Tisane is likely to change more than the graph IR and (ii) outputting the graph to read it back in might not be necessary.

TODOS related to Tisane R:

@emjun: write R API --> Python script
@audreyseo: revise code generation (Tisane GUI --> code) (see related issue)

Dataset, DataVector classes; Data schema/property inference

Automatically infer data properties and type from Dataset
Use Dataset, DataVector classes -- think through and implement them

Try SAT and ASP formulation of StatisticalModel Selection problem

12.03: Start trying SAT formulation using Z3

TODO:

Try ASP -- see which one is easy, etc.

Other:

[12.03] talked w Alan Borning about possible solvers and logical formulations --> SAT seems to make a lot of sense, finite domain constraints

Explanation of effect parameter to causes is confusing

In API_OVERVIEW.md, the description of the effect parameter to causes is effect: tisane.variable.AbstractVariable -- the cause data variable. Is effect not supposed to be the the result of the cause-ing variable?

Effect Set: Change to include Concepts, rather than Concept names?

As of Jan. 5, 2021, the Effect Set uses strings of Concept names (due to how we generate effects from the conceptual graph, and ultimately how we create nodes in the concept graph).

To make it easier/facilitate more seamless modeling from Effect Set (see model function in Tisane class), it may make more sense to use concepts, rather than strings?