emjun / tisane Goto Github PK
View Code? Open in Web Editor NEWSpecification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models
License: Apache License 2.0
Specification language for generating Generalized Linear Models (with or without mixed effects) from conceptual models
License: Apache License 2.0
Use Python 3 pattern:
from abc import ABCMeta, abstractmethod
class ClassName(object, metaclass=ABCMeta):
....
There is one primary directive in Tisane. The directive instructs Tisane to synthesize a statistical model based on a set of variables and their conceptual and data measurement relationships, as expressed in Tisane (#20).
End-users must construct a study design to pass to the directive:
design = ts.Design(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv')
ts.synthesize_statistical_model(design)
Idea: An alternative may be to use the design data structure internally but only require the end-user to provide a set of variables:
ts.synthesize_statistical_model(dv=math, ivs=[hw, race, mean_ses], data='nes88.csv')
This directive emphasizes one conceptual difference between Tisane and other statistical modeling tools:
In an evaluation, it would be great to learn how intuitive/easy it is to use Tisane's API: Is expressing a set of relationships easier than expressing a statistical model mathematically?
Tisane currently provides two types of conceptual relationships: causes
and associates_with
. This doc covers when and how to use these verbs.
If a user provides associates_with, we walk them through possible association patterns to identify the underlying causal relationships. In other words, associates_with
indicates a need for disambiguation to compile to a series of causes
statements.
To do this well, we need to resolve two competing interests: causal accuracy and usability. Prioritizing causal accuracy, the system should help an analyst distinguish and choose among an exhaustive list of possible causal situations. However, doing so may be unusable because the task of differentiating among numerous possible causal situations may be unrealistic for analysts unfamiliar with causality. These concerns do not seem insurmountable.
With an infinite number of hidden variables, there are an infinite number of possible causal relationships. We could restrict the number of hidden variables an analyst considers. This decision compromises causal accuracy for usability. If we had a justifiable cap on hidden variables, it may be worthwhile to take this approach.
Another perspective: If the goal is to translate each associates_with
into a set of causes
, why provide associates_with
at all?
The primary reason I wanted to provide both was because of the following:
In all these cases, it seems important to acknowledge what is known, what is hypothesized/the focus of inquiry, and what is asserted for the scope of the analysis. (accurate documentation, transparency)
In the current version of Tisane, analysts can express any relationships they might know or are probing into using causes
. If analysts do not want to assert any causal relationships due to a perceived lack of evidence in their field, they should use associates_with
. Whenever possible, analysts should use causes
instead of associates_with
.
Tisane's model inference process makes argubaly less useful covariate selection recommendations based on associates_with
relationships. Tisane looks for variables that have associates_with
relationships with both one of the IVs and the DV. Tisane suggests these variables as covariates with caution, including a warning in the Tisane GUI and a tooltip explaining to analysts that associates_with
edges may have additional causal confounders that are not specified or detectable with the current specification.
For the causes
relationships, Tisane uses the disjunctive criteria, developed for settings where researchers may be uncertain about their causal models, to recommend possible confounders as covariates.
We assume that the set of IVs an end-user provides in their query are the ones they are most interested in and want to treat as exposures.
What happens if the initial choice of variables could lead to confusion in interpretation of results?
We currently treat each IV as a separate exposure and combine all confounders into one model. In some cases, this may lead to interpretation confusion. For example, if the model includes two variables on the same causal path, one of the variables may appear to have no effect on the outcome even if it does (due to d-separation). We currently expect analysts to be aware of and interpret their results accurately in light of their variable selection choices. In their input queries, analysts should include only the variables they absolutely care the most about in their queries.
I would like to see the following (working list, no priority given yet):
Implementation changes:
Follow-up work/Paper ideas:
In EffectSet class, it's possible to have self.interaction = InteractionEffect(effect=None) . This should be self.interaction = None.
I don't think having effect = None makes sense.
Changing this requires updating the has_<main/interaction/mixed>_effects() functions.
When an end-user does not include a variable as a main effect or interaction effect (not at all in the final statistical model), do not ask about data transformations!
Test:
For example, if we take the following code:
def test_more_complex(self):
student = ts.Unit(
"Student", attributes=[]
) # object type, specify data types through object type
race = student.nominal("Race", cardinality=5, exactly=1) # proper OOP
ses = student.numeric("SES")
test_score = student.nominal("Test score")
tutoring = student.nominal("treatment")
race.associates_with(test_score)
student.associates_with(test_score)
race.moderate(ses, on=test_score)
design = ts.Design(dv=test_score, ivs=[race, ses])
gr = design.graph
print(gr.get_nodes())
self.assertTrue(gr.has_variable(test_score))
the print
will print an empty list, and the assertion will fail. This seems to be because graph.py
requires relationships from tisane.og_variable
instead of from tisane.variable
, and tisane.design
calls tisane.graph.Graph.add_relationship
to add edges:
# from tisane/graph.py
def add_relationship(
self, relationship: Union[Has, Treatment, Nest, Associate, Cause, Moderate]
):
if isinstance(relationship, Has):
identifier = relationship.variable
measure = relationship.measure
repetitions = relationship.repetitions
self.has(identifier, measure, relationship, repetitions)
elif isinstance(relationship, Treatment):
identifier = relationship.unit
treatment = relationship.treatment
repetitions = relationship.num_assignments
self.treat(unit=identifier, treatment=treatment, treatment_obj=relationship)
# ...
The types for relationship
are imported from tisane.og_variable
, which means that none of the relationships are added as edges.
The most recent revision attempts to make variable relationships clearer and obvious from the syntax. A nice consequence of this revision is that the conceptual differences between Tisane and existing software tools are more apparent.
An end-user expresses variables according to their data type. If the end-user later provides the data, the variable names should be the column names. For nominal or ordinal data, end-users also must specify the cardinality of variables if they do not intend to provide data. If end-users provide data, cardinality information is not required. In this case, Tisane will calculate and populate these fields internally.
Variables are observed values of a measure. Variables can be measures of interest, as in dependent and independent variables. Variables can also be id numbers that act as keys to a dataframe (e.g., participant id).
import tisane as ts
# Example 1:
hw = ts.Numeric('Homework') # 'homework' is the column name
race = ts.Nominal('Race', cardinality=5) # there are 5 groups/options for the variable race
math = ts.Numeric('MathAchievement')
mean_ses = ts.Numeric('Mean_SES')
student = ts.Nominal('student id', cardinality=100) # IDs 100 students included in this study
school = ts.Nominal('school', cardinality=10) # IDs for schools, 10 students/school
# Example 2:
leaf_length = ts.Numeric('length')
fertilizer = ts.Nominal('fertilizer condition', cardinality=2)
season = ts.Nominal('season', cardinality=4)
plant = ts.Nominal('plant id')
bed = ts.Nominal('plant bed')
An end-user expresses relationships between variables that are related to domain theory (conceptual models) and data measurements.
There are two types of conceptual relationships: cause
and associates_with
# Example 1
hw.cause(math) # Hours spent on homework causes math achievement.
race.associates_with(math) # Math scores and race are associated with each other.
# Example 2
fertilizer.cause(leaf_length) # Fertilizer causes leaf growth
Definitions:
cause
: The LHS variable causes the RHS variable. The RHS variable cannot also cause the LHS variable.associates_with
: The LHS and RHS variables are associated/related in some way that is not causal.Tisane provides aliases to both: causes
and cause
and associate_with
and associates_with
There are three types of data measurement relationships: (1) measurement attribution, (2) treatment for experiments, and (3) data hierarchies.
# Example 1:
student.has(hw)
student.has(race)
student.has(math)
school.has(mean_ses)
# Example 2:
plant.has(leaf_length)
Definition:
has
distinguishes "levels" of observations by attributing variables to each level. In Example 1, there are two levels: student and school. Each student has a value for homework, race, and match. Each school has a value for mean_ses.Idea: Create a separate Data type for "ID" and enforce that only variables of type "ID" can have
other variables.
End-users can express experimental treatments/manipulations.
# Example 2:
fertilizer.treats(bed)
Only Example 2 is an experiment. Each bed is treated with a fertilizer. In other words, fertilizer is a bed-level manipulation.
Definition:
treats
expresses the explicit/intentional manipulation of variables in an experiment. X.treats(Y) is internally equivalent to Y.has(X), which means that each Y has an observation for X.Idea: Check that the LHS variable of treats has a causal relationship (in the graph) with the DV? And keep
treatsand
has` different from one another.
Data can be clustered or nested. Tisane provides support for expressing two possible sources of clustering: (1) repeated measures and (2) nested relationships.
# Example 1
student.nest_under(school) # Students belong to a school. Students within a school might also cluster more than between schools.
# Example 2
plant.nest_under(bed) # Plants belong in plant beds.
plant.repeats(measure=leaf_length, repetitions=season) # Repeatedly measure the same plant once per season
Definitions:
nest_under
nests one variable under another.repeats
means the LHS variable provides multiple values of the measure. Each value is enumerated/indexed by the repetitions variable (e.g., season). If a plant provides multiple measures per season, another column for indexing each measure is required.import tisane as ts
import pandas as pd
import os
FILE_NAME = "schools.csv"
dir = os.path.dirname(__file__)
df = pd.read_csv(os.path.join(dir, FILE_NAME))
# Initialize Units
school = ts.Unit("schid", cardinality=10)
student = ts.Unit("stuid", cardinality=96)
homework = student.ordinal("homework", order=[0, 1, 2, 3, 4, 5, 6, 7])
# school variables
school_size = school.ordinal("scsize", order=[2,3,4,6])
school_region = school.nominal("region")
school_type = school.ordinal("sctype", order=[1, 4])
public = school.nominal("public")
# Define relationships
public.causes(homework)
public.moderates(school_region, on=homework)
school_size.associates_with(homework)
school_type.associates_with(homework)
design = ts.Design(dv=school_size, ivs=[school_region]).assign_data(df)
ts.infer_statistical_model_from_design(design=design)
This was a bug.
Seems that issue was that we were deepcopying graphs and nodes when updating the graphs, so the references to objects in the test cases were "older" objects that we updated but were not in the ConceptGraphs. I removed deepcopies in the latest commit.
Rather than requiring end-users to specify the categories in a nominal variable, for example, can we detect the categories and then ask an end-user to verify or at least assume that data is clean and then not require interaction?
The links for Jump to see a tutorial here and see some examples here appear broken on https://pypi.org/project/tisane/.
:)
API_OVERVIEW.md contains a "Has" relationship which is no longer supported.
Move some functions (e.g., get_concept_constraints and get_effect_set_constraints) into a helper file. These are not essential to KnowledgeBase class although they are helpful for using the KnowledgeBase
If declare concepts but not variables (esp. data type), ask user for input rather than error out. --> Some kind of hook for checking and then asking for additional input.
Ex. When querying knowledge base, check that Variable is declared before verifying/solving. If Variable is not declared, interactivity
Elicit weights for edges that represent causes, correlations that want to include in the final statistical model and then maximize these?
Presenting differences between models (set of effects for each model may differ): Visual pattern matching (make similarities easy to look over, differences more prominent, something from NLP/error analysis lit?
Milestones for interaction:
Low fidelity pass for interactivity: command line interaction (pdb)
Implementation ideas:
Right now, collecting and formatting assertions from properties of variables and effects sets is blended together into one step (see Tisane.collect_assertions and helper methods)
Seems to me that the right/better design would be:
collect assertions -> dict [in Tisane class where going to be controlling modeling?]
format assertions -> can pick ASP vs. SMT (hypothetically) although we are focusing on ASP [helper method]
query with assertions -> knowledge base [KnowledgeBase is just a wrapper around the ASP solving process]
Pro:
In general, create tests for each class, make sure include tests for each function
causes
and associates
edges should be the same style (solid) and different from has
(dotted), nests
(dashed)def get_causes_associates_tikz_graph(self)
--> def get_causes_associates_tikz_graph(self, dv: AbstractVariable)
Similar to "sign posting" programming method
Start modeling --> returns intermediate (file, assertions collected so far)
Finish modeling --> takes start modeling output as input and incrementally calls solver; if no error, return final stdout/call return StatisticalModels
As of Dec. 03, relying primarily on conceptual graph to suggest main effects. Could consider additions or alternatives, such as :
DSL:
Interface:
Ideas:
Goal: Create an R version of Tisane
Considerations:
R API (user input code) --> Python script --> JSON --> Tisane GUI --> R Code (output statistical modeling code)
Note: The new part is R API --> Python script. The rest is already how Tisane (Python implementation) works.
Put another way, the goal is to "transpile" R into Python.
In all of these: Key thing is to control Python script execution through a bash script, which we can call from R.
As of January 18, 2022: I opt for Strategy 1 first because (i) I suspect the syntax of Tisane is likely to change more than the graph IR and (ii) outputting the graph to read it back in might not be necessary.
TODOS related to Tisane R:
12.03: Start trying SAT formulation using Z3
TODO:
Other:
In API_OVERVIEW.md
, the description of the effect
parameter to causes
is effect: tisane.variable.AbstractVariable -- the cause data variable
. Is effect
not supposed to be the the result of the cause-ing variable?
As of Jan. 5, 2021, the Effect Set uses strings of Concept names (due to how we generate effects from the conceptual graph, and ultimately how we create nodes in the concept graph).
To make it easier/facilitate more seamless modeling from Effect Set (see model function in Tisane class), it may make more sense to use concepts, rather than strings?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.