Giter Site home page Giter Site logo

emjun / tisane Goto Github PK

View Code? Open in Web Editor NEW
20.0 3.0 2.0 36.55 MB

Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models

License: Apache License 2.0

Python 80.55% Jupyter Notebook 19.45% CSS 0.01%
statistics statistical-analysis linear-regression linear-models generalized-linear-mixed-models generalized-linear-models domain-specific-language statistical-validity

tisane's People

Contributors

audreyseo avatar cclauss avatar emjun avatar shreyashnigam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

audreyseo leibatt

tisane's Issues

Knowledge Base Query Results

  • Need to make sure that the atoms in the Knowledge Base come back as unspecified when they are necessary and we can verify them somehow --> useful logical formulation
  • Currently, the results depend on SAT result. It may be possible, however, to have SAT with missing atoms (see above note). We need to make sure this doesn't happen. If it does, we need to parse/interpret result without relying on SAT.

Update Query API

There is one primary directive in Tisane. The directive instructs Tisane to synthesize a statistical model based on a set of variables and their conceptual and data measurement relationships, as expressed in Tisane (#20).

End-users must construct a study design to pass to the directive:

design = ts.Design(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv') 

ts.synthesize_statistical_model(design)

Idea: An alternative may be to use the design data structure internally but only require the end-user to provide a set of variables:

ts.synthesize_statistical_model(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv')

This directive emphasizes one conceptual difference between Tisane and other statistical modeling tools:

  1. Unlike other tools that require end-users to fully-specify their statistical models mathematically, Tisane figures out the statistical model effects structure based on conceptual and data measurement relationships, asking the user for input in the face of modeling ambiguity. Tisane requires end-users specify the dependent and independent variables and any relevant conceptual and measurement relationships.

In an evaluation, it would be great to learn how intuitive/easy it is to use Tisane's API: Is expressing a set of relationships easier than expressing a statistical model mathematically?

[RFC] Causes vs. Associates_with

Tisane currently provides two types of conceptual relationships: causes and associates_with. This doc covers when and how to use these verbs.

If a user provides associates_with, we walk them through possible association patterns to identify the underlying causal relationships. In other words, associates_with indicates a need for disambiguation to compile to a series of causes statements.

To do this well, we need to resolve two competing interests: causal accuracy and usability. Prioritizing causal accuracy, the system should help an analyst distinguish and choose among an exhaustive list of possible causal situations. However, doing so may be unusable because the task of differentiating among numerous possible causal situations may be unrealistic for analysts unfamiliar with causality. These concerns do not seem insurmountable.

With an infinite number of hidden variables, there are an infinite number of possible causal relationships. We could restrict the number of hidden variables an analyst considers. This decision compromises causal accuracy for usability. If we had a justifiable cap on hidden variables, it may be worthwhile to take this approach.

Another perspective: If the goal is to translate each associates_with into a set of causes, why provide associates_with at all?

The primary reason I wanted to provide both was because of the following:

  • Analysts are sometimes unsure about the causal edges in their conceptual models. This uncertainty can be due to their own lack of knowledge or because the relationships are hypothesized but not known and now the analysts want to see if data supports the hypothesized relationships.
  • There may be a lack of definitive evidence in a domain about some causal edges and paths (that may involve multiple variables).

In all these cases, it seems important to acknowledge what is known, what is hypothesized/the focus of inquiry, and what is asserted for the scope of the analysis. (accurate documentation, transparency)

In the current version of Tisane, analysts can express any relationships they might know or are probing into using causes. If analysts do not want to assert any causal relationships due to a perceived lack of evidence in their field, they should use associates_with. Whenever possible, analysts should use causes instead of associates_with.

Tisane's model inference process makes argubaly less useful covariate selection recommendations based on associates_with relationships. Tisane looks for variables that have associates_with relationships with both one of the IVs and the DV. Tisane suggests these variables as covariates with caution, including a warning in the Tisane GUI and a tooltip explaining to analysts that associates_with edges may have additional causal confounders that are not specified or detectable with the current specification.

For the causes relationships, Tisane uses the disjunctive criteria, developed for settings where researchers may be uncertain about their causal models, to recommend possible confounders as covariates.

We assume that the set of IVs an end-user provides in their query are the ones they are most interested in and want to treat as exposures.

What happens if the initial choice of variables could lead to confusion in interpretation of results?
We currently treat each IV as a separate exposure and combine all confounders into one model. In some cases, this may lead to interpretation confusion. For example, if the model includes two variables on the same causal path, one of the variables may appear to have no effect on the outcome even if it does (due to d-separation). We currently expect analysts to be aware of and interpret their results accurately in light of their variable selection choices. In their input queries, analysts should include only the variables they absolutely care the most about in their queries.

Moving forward

I would like to see the following (working list, no priority given yet):

  • Tisane: Separate out use cases and provide language constructs for each: lack of knowledge vs. hypothesized causal edges vs. lack of definitive evidence in the domain.
    • Language design: Remove associates with, require only causes
    • Provide a "gallery" or "library" of canonical graph shapes/statements they could adapt.
    • Allow for inclusion of hidden variables?
    • Generate multiple linear models to verify the input DAG/mechanism validation.
    • Enforce variable selection that guarantees accurate inference, not just DAG/mechanism validation.
  • A question I keep coming back to: How usable is causal modeling to non-experts and how can we make it more usable to them?
    • Study/find out what makes stating "causes" statements difficult for researchers and how to constructively support their skepticism rather than allowing them to avoid formalizing their knowledge.

Implementation changes:

  • [BIG] Thomas R. had some hesitation about the theoretical soundness of the disjunctive criteria. He did not expand much, but I hope to meet with him early in the winter quarter to discuss.
  • I could re-implement Tisane in R so that it uses Daggity under the hood. Would have to see how to use Daggity under the hood in Python.
  • In R, I've never created a widget/plug in, but I can look into how to do that.
  • Both R and Python versions could use a code-only interface, not having to rely on the GUI.

Follow-up work/Paper ideas:

  • Eval and improve conceptual modeling language
  • Eval Tisane vs. R

Weird error for Effects

In EffectSet class, it's possible to have self.interaction = InteractionEffect(effect=None) . This should be self.interaction = None.

I don't think having effect = None makes sense.

Changing this requires updating the has_<main/interaction/mixed>_effects() functions.

Graph created from a Design is empty

For example, if we take the following code:

    def test_more_complex(self):
        student = ts.Unit(
            "Student", attributes=[]
        )  # object type, specify data types through object type
        race = student.nominal("Race", cardinality=5, exactly=1)  # proper OOP
        ses = student.numeric("SES")
        test_score = student.nominal("Test score")
        tutoring = student.nominal("treatment")
        race.associates_with(test_score)
        student.associates_with(test_score)
        race.moderate(ses, on=test_score)
        design = ts.Design(dv=test_score, ivs=[race, ses])
        gr = design.graph
        print(gr.get_nodes())
        self.assertTrue(gr.has_variable(test_score))

the print will print an empty list, and the assertion will fail. This seems to be because graph.py requires relationships from tisane.og_variable instead of from tisane.variable, and tisane.design calls tisane.graph.Graph.add_relationship to add edges:

# from tisane/graph.py
    def add_relationship(
        self, relationship: Union[Has, Treatment, Nest, Associate, Cause, Moderate]
    ):
        if isinstance(relationship, Has):
            identifier = relationship.variable
            measure = relationship.measure
            repetitions = relationship.repetitions
            self.has(identifier, measure, relationship, repetitions)
        elif isinstance(relationship, Treatment):
            identifier = relationship.unit
            treatment = relationship.treatment
            repetitions = relationship.num_assignments
            self.treat(unit=identifier, treatment=treatment, treatment_obj=relationship)
        # ...

The types for relationship are imported from tisane.og_variable, which means that none of the relationships are added as edges.

Update API to focus on variable relationships

The most recent revision attempts to make variable relationships clearer and obvious from the syntax. A nice consequence of this revision is that the conceptual differences between Tisane and existing software tools are more apparent.

Variables

An end-user expresses variables according to their data type. If the end-user later provides the data, the variable names should be the column names. For nominal or ordinal data, end-users also must specify the cardinality of variables if they do not intend to provide data. If end-users provide data, cardinality information is not required. In this case, Tisane will calculate and populate these fields internally.

Variables are observed values of a measure. Variables can be measures of interest, as in dependent and independent variables. Variables can also be id numbers that act as keys to a dataframe (e.g., participant id).

import tisane as ts

# Example 1: 
hw = ts.Numeric('Homework') # 'homework' is the column name
race = ts.Nominal('Race', cardinality=5) # there are 5 groups/options for the variable race
math = ts.Numeric('MathAchievement') 
mean_ses = ts.Numeric('Mean_SES')
student = ts.Nominal('student id', cardinality=100) # IDs 100 students included in this study 
school = ts.Nominal('school', cardinality=10) # IDs for schools, 10 students/school

# Example 2: 
leaf_length = ts.Numeric('length')
fertilizer = ts.Nominal('fertilizer condition', cardinality=2)
season = ts.Nominal('season', cardinality=4)
plant = ts.Nominal('plant id') 
bed = ts.Nominal('plant bed') 

An end-user expresses relationships between variables that are related to domain theory (conceptual models) and data measurements.

Conceptual Relationships

There are two types of conceptual relationships: cause and associates_with

# Example 1
hw.cause(math) # Hours spent on homework causes math achievement. 
race.associates_with(math) # Math scores and race are associated with each other. 

# Example 2
fertilizer.cause(leaf_length) # Fertilizer causes leaf growth

Definitions:

  • cause: The LHS variable causes the RHS variable. The RHS variable cannot also cause the LHS variable.
  • associates_with: The LHS and RHS variables are associated/related in some way that is not causal.

Tisane provides aliases to both: causes and cause and associate_with and associates_with

Data measurement relationships

There are three types of data measurement relationships: (1) measurement attribution, (2) treatment for experiments, and (3) data hierarchies.

Measurement attribution

# Example 1: 
student.has(hw)
student.has(race)
student.has(math)
school.has(mean_ses)

# Example 2: 
plant.has(leaf_length)

Definition:

  • has distinguishes "levels" of observations by attributing variables to each level. In Example 1, there are two levels: student and school. Each student has a value for homework, race, and match. Each school has a value for mean_ses.

Idea: Create a separate Data type for "ID" and enforce that only variables of type "ID" can have other variables.

Treatment

End-users can express experimental treatments/manipulations.

# Example 2: 
fertilizer.treats(bed)

Only Example 2 is an experiment. Each bed is treated with a fertilizer. In other words, fertilizer is a bed-level manipulation.

Definition:

  • treats expresses the explicit/intentional manipulation of variables in an experiment. X.treats(Y) is internally equivalent to Y.has(X), which means that each Y has an observation for X.

Idea: Check that the LHS variable of treats has a causal relationship (in the graph) with the DV? And keep treatsandhas` different from one another.

Data hierarchies

Data can be clustered or nested. Tisane provides support for expressing two possible sources of clustering: (1) repeated measures and (2) nested relationships.

# Example 1 
student.nest_under(school) # Students belong to a school. Students within a school might also cluster more than between schools. 

# Example 2 
plant.nest_under(bed) # Plants belong in plant beds. 
plant.repeats(measure=leaf_length, repetitions=season) # Repeatedly measure the same plant once per season

Definitions:

  • nest_under nests one variable under another.
  • repeats means the LHS variable provides multiple values of the measure. Each value is enumerated/indexed by the repetitions variable (e.g., season). If a plant provides multiple measures per season, another column for indexing each measure is required.

Moderation on Nominal Not Working

import tisane as ts
import pandas as pd
import os


FILE_NAME = "schools.csv"

dir = os.path.dirname(__file__)
df = pd.read_csv(os.path.join(dir, FILE_NAME))


# Initialize Units
school = ts.Unit("schid", cardinality=10)
student = ts.Unit("stuid", cardinality=96)


homework = student.ordinal("homework", order=[0, 1, 2, 3, 4, 5, 6, 7])

# school variables
school_size = school.ordinal("scsize", order=[2,3,4,6])
school_region = school.nominal("region")
school_type = school.ordinal("sctype", order=[1, 4])
public = school.nominal("public")

# Define relationships
public.causes(homework)
public.moderates(school_region, on=homework)
school_size.associates_with(homework)
school_type.associates_with(homework)

design = ts.Design(dv=school_size, ivs=[school_region]).assign_data(df)

ts.infer_statistical_model_from_design(design=design)

When update concepts, not reflected in ConceptGraph

This was a bug.

Seems that issue was that we were deepcopying graphs and nodes when updating the graphs, so the references to objects in the test cases were "older" objects that we updated but were not in the ConceptGraphs. I removed deepcopies in the latest commit.

Automatic data schema detection

Rather than requiring end-users to specify the categories in a nominal variable, for example, can we detect the categories and then ask an end-user to verify or at least assume that data is clean and then not require interaction?

  • This seems particularly helpful for an interactive system...

Knowledge Base Class

Move some functions (e.g., get_concept_constraints and get_effect_set_constraints) into a helper file. These are not essential to KnowledgeBase class although they are helpful for using the KnowledgeBase

Interactions to support/incorporate

  • If declare concepts but not variables (esp. data type), ask user for input rather than error out. --> Some kind of hook for checking and then asking for additional input.
    Ex. When querying knowledge base, check that Variable is declared before verifying/solving. If Variable is not declared, interactivity

  • Elicit weights for edges that represent causes, correlations that want to include in the final statistical model and then maximize these?

  • Presenting differences between models (set of effects for each model may differ): Visual pattern matching (make similarities easy to look over, differences more prominent, something from NLP/error analysis lit?

Milestones for interaction:
Low fidelity pass for interactivity: command line interaction (pdb)

Implementation ideas:

  • Could we use the listener pattern somehow such that missing info is logged somewhere, updates to info are logged somewhere, etc.?

Collecting Assertions

Right now, collecting and formatting assertions from properties of variables and effects sets is blended together into one step (see Tisane.collect_assertions and helper methods)

Seems to me that the right/better design would be:
collect assertions -> dict [in Tisane class where going to be controlling modeling?]
format assertions -> can pick ASP vs. SMT (hypothetically) although we are focusing on ASP [helper method]
query with assertions -> knowledge base [KnowledgeBase is just a wrapper around the ASP solving process]

Pro:

  • makes code more modular and therefore more extensible
    Con:
  • more work to do ๐Ÿ˜›
  • update, add new tests

Related Issues: Issue #9, #13

Tests to add

  • Concepts: getting variable names

In general, create tests for each class, make sure include tests for each function

Fixes to graph visualization in tikz + dot!

  • causes and associates edges should be the same style (solid) and different from has (dotted), nests (dashed)
  • dependent variables should be filled in with "light-ish" grey (i.e., grey!30) -->
    def get_causes_associates_tikz_graph(self) --> def get_causes_associates_tikz_graph(self, dv: AbstractVariable)

Algorithm for (Interactive) solving for valid statistical models

Similar to "sign posting" programming method

Start modeling --> returns intermediate (file, assertions collected so far)

  • collect all assertions from variables, effect set

Finish modeling --> takes start modeling output as input and incrementally calls solver; if no error, return final stdout/call return StatisticalModels

  • query and collect assertions that need to know in order to pick valid statistical model

Inferring candidate main effects

As of Dec. 03, relying primarily on conceptual graph to suggest main effects. Could consider additions or alternatives, such as :

  • checking for type of edge (causal, correlational) in conceptual graph
  • looking at the dataset
  • may depend on "task" for explanation or prediction

Checking for data properties/assertions

DSL:

  • provide some sort of documentation to show which properties can apply.

Interface:

  • fade out the invalid assertions -- E.g., if variable is nominal, cannot be normally distributed.

RFC on strategy/design doc for Tisane R

Goal: Create an R version of Tisane

Considerations:

  • Keep user-facing API as R-idiomatic as possible
  • Reduce duplicate maintenance efforts. So that changes to Tisane in Python will improve/update Tisane R package.

Pipeline at 10,000 ft

R API (user input code) --> Python script --> JSON --> Tisane GUI --> R Code (output statistical modeling code)

Note: The new part is R API --> Python script. The rest is already how Tisane (Python implementation) works.
Put another way, the goal is to "transpile" R into Python.

How to compile/transpile R into Python?

  • Strategy 1: Build up internal graph IR in R, traverse graph to produce Python code
  • Strategy 2: Parse R script into AST, traverse AST, generate Python code from AST
  • Strategy 3: Build up internal graph IR in R, output graph IR in some format (maybe DOT or something like that), read in graph output, write Python code from graph

In all of these: Key thing is to control Python script execution through a bash script, which we can call from R.

Current/next steps

As of January 18, 2022: I opt for Strategy 1 first because (i) I suspect the syntax of Tisane is likely to change more than the graph IR and (ii) outputting the graph to read it back in might not be necessary.

TODOS related to Tisane R:

Effect Set: Change to include Concepts, rather than Concept names?

As of Jan. 5, 2021, the Effect Set uses strings of Concept names (due to how we generate effects from the conceptual graph, and ultimately how we create nodes in the concept graph).

To make it easier/facilitate more seamless modeling from Effect Set (see model function in Tisane class), it may make more sense to use concepts, rather than strings?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.