Giter Site home page Giter Site logo

zalando / expan Goto Github PK

View Code? Open in Web Editor NEW
330.0 27.0 52.0 1.27 MB

Open-source Python library for statistical analysis of randomised control trials (A/B tests)

License: MIT License

Makefile 1.09% Python 97.97% Shell 0.39% Stan 0.56%
python statistics abtesting abtest statistical-analysis experimentation ab-testing causal-inference

expan's People

Contributors

aaron-mcdaid-zalando avatar daryadedik avatar domheger avatar gbordyugov avatar mkolarek avatar pangeran-bottor avatar perploug avatar robertmuil avatar s4826 avatar sdia-zz avatar shansfolder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

expan's Issues

No module named experiment and test_data

from expan.experiment import Experiment raises an ImportError: No module named experiment.
I have to change it to from expan.core.experiment import Experiment and it works.

Also from tests.test_data import generate_random_data doesn't work, I have to use
from tests.tests_data import generate_random_data but this raises an another error cannot import name generate_random_data.

sanity check whether the input data contains duplicated entities

The ExperimentData class can handle 2 kinds of input data as the metrics argument:

  • aggregated metrics: this should always be aggregated on the entity level
  • time-resolved metrics: this requires an additional column time_since_treatment in the input, and should always be aggregated per unique entity and time point

coverage % is misleading

We're getting very good coverage percentages out of runtests htmlcov but the numbers are quite misleading: at least some of the unittests are not actually verifying the returned data properly: e.g. test_experiment.test__feature_check__computation() doesn't verify the feature check of the feature feature...

We should:

  • push all unittests as low as possible in the function call structure (e.g. check the feature check directly, not through the class-level interface) so that it is clearer what needs to be checked
  • when writing unittests, indicate clearly (if only in the function docstrings) whether some checks have not been implemented yet, or better:
  • do not call functions in a unittest unless you are sure you have comprehensively checked the return

In any case when we are performing a complex operation on a data-set (like a feature-check of a dataframe) we must be careful to really check the return: otherwise all the code that gets hit by the called operation will be marked as 'covered' but will not actually have been tested!

this is a very general problem... I'm sure others have had the same issue... I wonder how they've dealt with it?

allow constructing an Experiment from an ExperimentData

Currently, although it's a subclass, no way to construct an Experiment from an ExperimentData which is a bit counter-intuitive.

Not really necessary (can always just pass whatever you passed to the ExperimentData constructor directly to the Experiment, but we should allow this.

Low priority.

Formula of derived KPIs only support ratios

The following code in Experiment.delta

pattern = '([a-zA-Z][0-9a-zA-Z_]*)'
self.kpis.loc[:,kpiName] = eval(re.sub(pattern, r'self.kpis.\1.astype(float)', dk['formula']))

will only work for fomulas with simple ratios (and also too hacky).

However, I found that some of the KPIs here requires sum/count/>0, etc...

Make baseline_variant mandatory in the metadata?

By doing so, we can deprecate the baseline_variant argument in the Experiment constructor, so to create an Experiment object, one only needs the kpi_df, feature_df and metadata, which is more intuitive.

Python3

Will there be a version for python3?

Optimizing the control flow from `Experiment` to `Results`

In the current implementation, there are three intermediate layers of steps that are nearly always executed upon invocation of a particular analysis method of Experiment class:

  • first, a method of Experiment class (such as Experiment.delta() for example) is called,
  • then an analysis specific _function_ (a local one to experiment.py, such as _delta_all_variants()) is called from that method, and lastly,
  • this function calls a *_to_dataframe_* function, which is again analysis-specific.

Thus, the details of analysis flow are spread over three independent levels of abstraction, making each of them not quite flexible and partially causing functionality and code duplication.

Question about percentiles in Bayesian early stopping

As I understood from code, percentiles values for regular delta and group sequential delta are using t statistics.
e.g. in the result object, "pctile" could be (2.5, 50, 97.5), "value" can be the corresponding t statistics (0.1, 0.8, 0.1).

On the other hand, percentiles values for two Bayesian delta are using a fixed 0.95 credible interval.
In the result object, "pctile" is always ("lower", "upper"), "value" is the corresponding index of posterior distribution.

I think they are two completely different concept, and if I understood correctly, should we put them into different columns in the result object?

t-test for non-equal distribution variances

Current: for t-test we assume that two populations have equal variances. The case of non-equal variances is not implemented.
Possible solution: add Welch's t-test implementation for the case of unequal populations variances.

Note: this question should be preliminary investigated: when it's necessary or helpful, when it's not.

KPIs arguments of Experiment.delta()

Hello,

I'm still puzzled by the way Experiment.delta() passes the list of KPIs to analyse to its daughter functions (i.e. fixed horizon, group sequential, and the two Bayes funcs).

More specifically:

  • Experiment.fixed_horizon_delta() expects the kpi_subset argument and at the same time fetches the list of KPIs in from the passed res dataframe https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L456, after which it kills the metadata of res
  • all new early stopping functions explicitly expect a list of KPIs as kpis_to_analyse and don't really care about kpis_to_analyse in res.

Is there any hidden magic in that which I might be missing?

@jbao @mkolarek any ideas on that?

Expose crucial classes in main expan namespace

Currently, to import the main classes of expan's functionality, we need to either use prefixes:
import expan; expan.Experiment(...)
or we have to import the class through the intermediate folder name, which is not intuitive:
from expan.core.experiment import Experiment

Ideal would be to allow from expan.experiment import Experiment although another possibility (and easier to implement) would be from expan import Experiment.

Use pd.Timedelta for time_since_treatment

To represent duration in the time_since_treatment column, we have 2 options:

  • put numerical values into time_since_treatment and persist the time unit in the metadata [non-intuitive and probably error-prone]
  • represent duration directly with the pandas dtype Timedelta [preferred]

rename fetcher to loader

under data, we still use the 'fetcher' terminology, despite all the advanced data-loading now being elsewhere. We should rename them data loaders, to make clearer that all they're doing is reading from a file and loading into the ExperimentData structure.

Assumption of nan when computing weighted KPIs

In weighted KPIs, our implementation is based on the assumptions that reference KPIs are never nan.

We might have some miscomputation here. I saw from previous BQ tests that this column can have missing values. We may want to implement the weights in a way that does not rely on this assumption.

Rethink Results structure

The goal of the current implementation of the Results structure is to enable the user to easily concatenate the results of multiple analyses (say delta() + sga()).
This functionality hasn't been used much, so it would be a good idea to rethink the Results structure and maybe aim to simplify it.

Add P(uplift>0) as a statistic

This probability will be calculated from the existing result data, based on the percentiles and the normal assumption.

frequency table in the chi square test doesn't respect the order of categories

e.g. when the counts of observed categories in both groups are as follows:

treat_counts
female     39364
male        6561
unknown     4474

control_counts
f               2
female     152099
m               2
male        24299
unknown     15084

the observed frequency table is:

observed_ct
female      male  unknown        f        m
0     0.0   39364.0      0.0   6561.0   4474.0
1     2.0  152099.0      2.0  24299.0  15084.0

What's the point of *_delta() methods accepting an (empty) Result instance?

In the current implementation, all *_delta() methods accept a res argument, being (oftentimes) an empty instance of the Result class. This instance is then updated with the results of the analysis.

On the other hand, other analysis methods, such as feature_check(), sga(), and trend() create a Result instance on the fly, populate it with results and return it.

What is the motivation between those two different approaches?

reassess actual dependencies

Currently, requirements.txt is a bit over-zealous with versions: expan doesn't actually need the latest version (0.17) of scipy for example.
To increase backward compatibility, we should go over what's actually required. Potentially long-winded operation I guess but perhaps there's a tool to do it semi-automatically?

normal_difference duplicates some functionality of normal_percentiles

normal_difference calculates percentiles for the difference between normal distributions.

Since the difference between normal distributions is normally distributed this functionality should be covered by a transformation of the input to the difference distribution and passed onto normal_percentiles.

Percentiles value is lost during computing group_sequential_delta

The parameter percentiles is lost during the following stack of calls:
-> experiment.group_sequential_delta (has input para percentiles)
-> early_stopping.group_sequential (lost input para percentiles)
-> statistics.normal_difference (has para percentiles again, but now we can only use the default value since the percentiles value user passed will be lost in previous step)

warnings no longer propagated to command line

It seems that the warnings, after having been put into the Results object, are not propagated to the calling context (like command line) so the user doesn't necessarily see them.
This should be reverted.

Support 'overall ratio' metrics (e.g. conversion rate/return rate) as opposed to per-entity ratios

In case of conversion rate or return rate, the KPI can be either defined on the entity level or aggregated over all entities, and we probably want to support both of them.

After some discussion, we came up with the idea to reweight the data of the individual entities to calculate the overall ratio statistics. This enables to use the existing statistics.delta() function to calculate the overall ratio statistics (using normal assumtion or bootstraping).

Calculating return rates

As an example let's look at return rates which are typically calculated (on individual entity level) as:
$individual_rr=1/n \sum_{i=1}^n \frac{ARTICLES-RETURNED_i}{ARTICLES-ORDERED_i}$

The overall ratio is a reweighting of individual_rr to reflect not the entities' contributions (e.g. contribution per custormer) but overall equally contributions to the return return rate (i.e. return rate on overall article basis) which can be formulated as:
$$overall_rr= \frac{1/n \sum_{i=1}^n ARTICLES-RETURNED_i}{1/n \sum_{i=1}^n ARTICLES-ORDERED_i}$$

Overall as reweighted Individual

One can calculate the overall_rr from the individual_rr using the following reweighting (easily proved by paper and pencile):
$$overall_rr=1/n \sum_{i=1}^n \alpha_i \frac{ARTICLES-RETURNED_i}{ARTICLES-ORDERED_i}$$
with
$$alpha_i= n \frac{ARTICLES-ORDERED_i}{\sum_{i=1}^n ARTICLES-ORDERED_i}

Weighted delta function

To have such functionality as a more generic approach in ExpAn we can use introduce a "weighted delta" function in statistics. Its input are

  • The entity based variables (such as \frac{ARTICLES-RETURNED}{ARTICLES-ORDERED}) - for treatment and control
  • A variable that specifies the quanities per entitiy (such as ARTICLES-ORDERED) - for treatment and control

With this input it calculates alpha as described above and outputs the result of statistics.delta()

Results.to_json() implementation not flexible

Current implementation will break if certain changes are done in the Results dataframe structure (such as adding additional indexes). Main cause of the issue is the nested for loops used to traverse the dataframe, they can probably be replaced with a recursive function.

Failing early stopping unit tests

Hi, on my Mac with Python 3.6.0 with numpy 1.10.4, scipy 0.17.0, and pystan 2.14.0.0, I've got three failing unit tests:

one:


    def test_bayes_factor(self):
            """
        	Check the Bayes factor function.
        	"""
            stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_factor(self.rand_s1, self.rand_s2)
            self.assertEqual(stop, 1)
            self.assertAlmostEqual(delta, -0.15887364780635896)
>           self.assertAlmostEqual(CI['lower'], -0.24414725578976518)
E           AssertionError: -0.24359237356716665 != -0.24414725578976518 within 7 places

tests/tests_core/test_early_stopping.py:73: AssertionError

two:


    def test_bayes_precision(self):
            """
        	Check the bayes_precision function.
        	"""
            stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_precision(self.rand_s1, self.rand_s2)
            self.assertEqual(stop, 0)
            self.assertAlmostEqual(delta, -0.15887364780635896)
>           self.assertAlmostEqual(CI['lower'], -0.25165623415486293)
E           AssertionError: -0.25058790472284048 != -0.25165623415486293 within 7 places

tests/tests_core/test_early_stopping.py:93: AssertionError

and three

    def test_bayes_precision_delta(self):
            """
    	    Check if Experiment.bayes_precision_delta() functions properly
    	    """
            # this should work
            self.assertTrue(isinstance(self.data, Experiment))  # check that the subclassing works
    
            self.assertTrue(self.data.baseline_variant == 'B')
    
            res = Results(None, metadata=self.data.metadata)
            result = self.data.bayes_precision_delta(result=res, kpis_to_analyse=['normal_same'])
    
            # check uplift
            df = result.statistic('delta', 'uplift', 'normal_same')
            np.testing.assert_almost_equal(df.loc[:, ('value', 'A')],
                                                                       np.array([0.033053]), decimal=5)
            # check stop
            df = result.statistic('delta', 'stop', 'normal_same')
            np.testing.assert_equal(df.loc[:, 'value'],
>                                                           np.array([[0, 0]]))

tests/tests_core/test_experiment.py:492: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = variant                                                  A    B
metric      subgroup_metric subgroup statistic pctile          
normal_same -               NaN      stop      NaN     1.0  0.0
y = array([[0, 0]]), err_msg = '', verbose = True

    def assert_array_equal(x, y, err_msg='', verbose=True):
        """
        Raises an AssertionError if two array_like objects are not equal.
    
        Given two array_like objects, check that the shape is equal and all
        elements of these objects are equal. An exception is raised at
        shape mismatch or conflicting values. In contrast to the standard usage
        in numpy, NaNs are compared like numbers, no assertion is raised if
        both objects have NaNs in the same positions.
    
        The usual caution for verifying equality with floating point numbers is
        advised.
    
        Parameters
        ----------
        x : array_like
            The actual object to check.
        y : array_like
            The desired, expected object.
        err_msg : str, optional
            The error message to be printed in case of failure.
        verbose : bool, optional
            If True, the conflicting values are appended to the error message.
    
        Raises
        ------
        AssertionError
            If actual and desired objects are not equal.
    
        See Also
        --------
        assert_allclose: Compare two array_like objects for equality with desired
                         relative and/or absolute precision.
        assert_array_almost_equal_nulp, assert_array_max_ulp, assert_equal
    
        Examples
        --------
        The first assert does not raise an exception:
    
        >>> np.testing.assert_array_equal([1.0,2.33333,np.nan],
        ...                               [np.exp(0),2.33333, np.nan])
    
        Assert fails with numerical inprecision with floats:
    
        >>> np.testing.assert_array_equal([1.0,np.pi,np.nan],
        ...                               [1, np.sqrt(np.pi)**2, np.nan])
        ...
        <type 'exceptions.ValueError'>:
        AssertionError:
        Arrays are not equal
        <BLANKLINE>
        (mismatch 50.0%)
         x: array([ 1.        ,  3.14159265,         NaN])
         y: array([ 1.        ,  3.14159265,         NaN])
    
        Use `assert_allclose` or one of the nulp (number of floating point values)
        functions for these cases instead:
    
        >>> np.testing.assert_allclose([1.0,np.pi,np.nan],
        ...                            [1, np.sqrt(np.pi)**2, np.nan],
        ...                            rtol=1e-10, atol=0)
    
        """
        assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
>                            verbose=verbose, header='Arrays are not equal')
E       AssertionError: 
E       Arrays are not equal
E       
E       (mismatch 50.0%)
E        x: array([[ 1.,  0.]])
E        y: array([[0, 0]])

z-test instead of t-test in the group sequential implementation

In the group sequential implementation is used z-test instead of t-test and sample standard deviation is used instead of population standard deviation. If the population standard deviation is unknown it would be more reasonable to use t-test instead of z-test.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.