zalando / expan Goto Github PK

View Code? Open in Web Editor NEW

330.0 27.0 52.0 1.27 MB

Open-source Python library for statistical analysis of randomised control trials (A/B tests)

License: MIT License

Makefile 1.09% Python 97.97% Shell 0.39% Stan 0.56%

python statistics abtesting abtest statistical-analysis experimentation ab-testing causal-inference

expan's People

Contributors

Stargazers

Watchers

Forkers

ml-ai-nlp-ir akifhazarvi jmkintanar datnamer jingxiang-li wanjinchang shannonyu benjamesbabala fulquan igusher michielcuijpers gbordyugov chaitanyasd s4826 thegreenjedi shansfolder daryadedik open-source-archive louisryan vdt rouseguy radovankavicky gapdata elgalu gridl robertmuil arpankbasak aaron-mcdaid-zalando cameroun sdia-zz dmclaine vishalbelsare usability-analysis-fun ochebotar debasishmaji agniji rongyousu mysticaltech csamaddar minyansh7 aidsj andompesta krin-wang xjx79 meralflaquer markusdegen cobzev-m dangvanconghoang jtatineni siddhantmaharana

expan's Issues

Categorical binning improvement

To bin a list x into N bins, one could simply go for the bin index given by

binIndex = hash(x[i]) % N

sga() and create_binning() have different default nbins

Also, use different defaults for categorical (e.g. 4) and numerical (e.g. 8).

No module named experiment and test_data

from expan.experiment import Experiment raises an ImportError: No module named experiment.
I have to change it to from expan.core.experiment import Experiment and it works.

Also from tests.test_data import generate_random_data doesn't work, I have to use
from tests.tests_data import generate_random_data but this raises an another error cannot import name generate_random_data.

sanity check whether the input data contains duplicated entities

The ExperimentData class can handle 2 kinds of input data as the metrics argument:

aggregated metrics: this should always be aggregated on the entity level
time-resolved metrics: this requires an additional column time_since_treatment in the input, and should always be aggregated per unique entity and time point

coverage % is misleading

We're getting very good coverage percentages out of runtests htmlcov but the numbers are quite misleading: at least some of the unittests are not actually verifying the returned data properly: e.g. test_experiment.test__feature_check__computation() doesn't verify the feature check of the feature feature...

We should:

push all unittests as low as possible in the function call structure (e.g. check the feature check directly, not through the class-level interface) so that it is clearer what needs to be checked
when writing unittests, indicate clearly (if only in the function docstrings) whether some checks have not been implemented yet, or better:
do not call functions in a unittest unless you are sure you have comprehensively checked the return

In any case when we are performing a complex operation on a data-set (like a feature-check of a dataframe) we must be careful to really check the return: otherwise all the code that gets hit by the called operation will be marked as 'covered' but will not actually have been tested!

this is a very general problem... I'm sure others have had the same issue... I wonder how they've dealt with it?

Weighted KPIs is only implemented in regular delta

Missing the functionality of weighted KPIs in group_sequential, Bayes_factor, Bayes_precision.

perform trend analysis cumulatively

two possible approaches:

post-hoc calculation of the joint CI based on this post
generate the bin labels cumulatively and reuse the current function

allow constructing an Experiment from an ExperimentData

Currently, although it's a subclass, no way to construct an Experiment from an ExperimentData which is a bit counter-intuitive.

Not really necessary (can always just pass whatever you passed to the ExperimentData constructor directly to the Experiment, but we should allow this.

Low priority.

Formula of derived KPIs only support ratios

The following code in Experiment.delta

pattern = '([a-zA-Z][0-9a-zA-Z_]*)'
self.kpis.loc[:,kpiName] = eval(re.sub(pattern, r'self.kpis.\1.astype(float)', dk['formula']))

will only work for fomulas with simple ratios (and also too hacky).

However, I found that some of the KPIs here requires sum/count/>0, etc...

Make baseline_variant mandatory in the metadata?

By doing so, we can deprecate the baseline_variant argument in the Experiment constructor, so to create an Experiment object, one only needs the kpi_df, feature_df and metadata, which is more intuitive.

upload full documentation!

add detailed description (mediawiki doc)
add full notebook slides showing example usage

Numbers cannot appear in variable names for derived metrics

Currently the regex does not match numbers when constructing derived metrics from the formula.

pip == 8.1.0 requirement

Hi, do we need this one?

Missing support for derived KPIs in trend(), sga(), feature_check()

das ist leider so.

Python3

Will there be a version for python3?

Optimizing the control flow from `Experiment` to `Results`

In the current implementation, there are three intermediate layers of steps that are nearly always executed upon invocation of a particular analysis method of Experiment class:

first, a method of Experiment class (such as Experiment.delta() for example) is called,
then an analysis specific _function_ (a local one to experiment.py, such as _delta_all_variants()) is called from that method, and lastly,
this function calls a *_to_dataframe_* function, which is again analysis-specific.

Thus, the details of analysis flow are spread over three independent levels of abstraction, making each of them not quite flexible and partially causing functionality and code duplication.

Question about percentiles in Bayesian early stopping

As I understood from code, percentiles values for regular delta and group sequential delta are using t statistics.
e.g. in the result object, "pctile" could be (2.5, 50, 97.5), "value" can be the corresponding t statistics (0.1, 0.8, 0.1).

On the other hand, percentiles values for two Bayesian delta are using a fixed 0.95 credible interval.
In the result object, "pctile" is always ("lower", "upper"), "value" is the corresponding index of posterior distribution.

I think they are two completely different concept, and if I understood correctly, should we put them into different columns in the result object?

t-test for non-equal distribution variances

Current: for t-test we assume that two populations have equal variances. The case of non-equal variances is not implemented.
Possible solution: add Welch's t-test implementation for the case of unequal populations variances.

Note: this question should be preliminary investigated: when it's necessary or helpful, when it's not.

Results.to_json() doesn't support trend() results

Current implementation of the to_json() method in the Results class doesn't support trend() results at the moment.

Experiment.sga() assumes features and KPIs are merged in self.metrics

This will break if one constructs the ExperimentData object with two separate feature and KPI data frames, in which case we need to merge them manually.

host intro slides (from the ipython notebook) somewhere for public viewing

Currently, the ipython notebook (Expan-Intro.ipynb) is available locally as slides with the serve_intro_slides script.

It would be great to host the converted notebook on a public server somewhere so that prospective users can view it without cloning or running a local jupyter server.

KPIs arguments of Experiment.delta()

Hello,

I'm still puzzled by the way Experiment.delta() passes the list of KPIs to analyse to its daughter functions (i.e. fixed horizon, group sequential, and the two Bayes funcs).

More specifically:

Experiment.fixed_horizon_delta() expects the kpi_subset argument and at the same time fetches the list of KPIs in from the passed res dataframe https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L456, after which it kills the metadata of res
all new early stopping functions explicitly expect a list of KPIs as kpis_to_analyse and don't really care about kpis_to_analyse in res.

Is there any hidden magic in that which I might be missing?

@jbao @mkolarek any ideas on that?

Expose crucial classes in main expan namespace

Currently, to import the main classes of expan's functionality, we need to either use prefixes:
import expan; expan.Experiment(...)
or we have to import the class through the intermediate folder name, which is not intuitive:
from expan.core.experiment import Experiment

Ideal would be to allow from expan.experiment import Experiment although another possibility (and easier to implement) would be from expan import Experiment.

Add argument assume_normal and treatment_cost to calculate_prob_uplift_over_zero() and prob_uplift_over_zero_single_metric()

as discussed by #2 and #24

Use pd.Timedelta for time_since_treatment

To represent duration in the time_since_treatment column, we have 2 options:

put numerical values into time_since_treatment and persist the time unit in the metadata [non-intuitive and probably error-prone]
represent duration directly with the pandas dtype Timedelta [preferred]

migrate issues from github enterprise

issues all still in GHE from prior to opensourcing... should migrate those

Bad code duplication in experiment.py

Experiment.delta() is now a dispatcher, however those lines https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L368 duplicate the code https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L455

Additional complication: delta() function calls the other one, i.e. fixed_horizon_delta()

rename fetcher to loader

under data, we still use the 'fetcher' terminology, despite all the advanced data-loading now being elsewhere. We should rename them data loaders, to make clearer that all they're doing is reading from a file and loading into the ExperimentData structure.

Assumption of nan when computing weighted KPIs

In weighted KPIs, our implementation is based on the assumptions that reference KPIs are never nan.

We might have some miscomputation here. I saw from previous BQ tests that this column can have missing values. We may want to implement the weights in a way that does not rely on this assumption.

Rethink Results structure

The goal of the current implementation of the Results structure is to enable the user to easily concatenate the results of multiple analyses (say delta() + sga()).
This functionality hasn't been used much, so it would be a good idea to rethink the Results structure and maybe aim to simplify it.

Add P(uplift>0) as a statistic

This probability will be calculated from the existing result data, based on the percentiles and the normal assumption.

multiple comparison correction

reenable means and bounds functions on Results object

Were temporarily removed after a change of structure... need to reenable them.
Especially because the Results.__str__ function relies on them.

pctile can be undefined in `Results.to_json()`

https://github.com/zalando/expan/blob/dev/expan/core/results.py#L440

"variant_subset" parameter of Experiment.fixed_horizon_delta() is not used

My understanding is that the code

https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L456-L460

is not relevant for the call of delta_all_variants() at

https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L475

frequency table in the chi square test doesn't respect the order of categories

e.g. when the counts of observed categories in both groups are as follows:

treat_counts
female     39364
male        6561
unknown     4474

control_counts
f               2
female     152099
m               2
male        24299
unknown     15084

the observed frequency table is:

observed_ct
female      male  unknown        f        m
0     0.0   39364.0      0.0   6561.0   4474.0
1     2.0  152099.0      2.0  24299.0  15084.0

Missing unit tests for to_json() on early stopping algos

Compare for instance

https://github.com/zalando/expan/blob/dev/tests/tests_core/test_results.py#L126
https://github.com/zalando/expan/blob/dev/tests/tests_core/test_results.py#L180

Additional parameters when using sampling from StanModel

Here in this line we have specified some additional parameters to StanModel.sampling: https://github.com/zalando/expan/blob/dev/expan/core/early_stopping.py#L189

Do we know what are the effects of specifying stepsize and adapt_delta in this case?

What's the point of *_delta() methods accepting an (empty) Result instance?

In the current implementation, all *_delta() methods accept a res argument, being (oftentimes) an empty instance of the Result class. This instance is then updated with the results of the analysis.

On the other hand, other analysis methods, such as feature_check(), sga(), and trend() create a Result instance on the fly, populate it with results and return it.

What is the motivation between those two different approaches?

Results.to_json() => TypeError: Object of type 'UserWarning' is not JSON serializable

if metadata contains a UserWarning(...), to_json() fails

reassess actual dependencies

Currently, requirements.txt is a bit over-zealous with versions: expan doesn't actually need the latest version (0.17) of scipy for example.
To increase backward compatibility, we should go over what's actually required. Potentially long-winded operation I guess but perhaps there's a tool to do it semi-automatically?

normal_difference duplicates some functionality of normal_percentiles

normal_difference calculates percentiles for the difference between normal distributions.

Since the difference between normal distributions is normally distributed this functionality should be covered by a transformation of the input to the difference distribution and passed onto normal_percentiles.

Implement version

to enable the standard call expan.version

Percentiles value is lost during computing group_sequential_delta

The parameter percentiles is lost during the following stack of calls:
-> experiment.group_sequential_delta (has input para percentiles)
-> early_stopping.group_sequential (lost input para percentiles)
-> statistics.normal_difference (has para percentiles again, but now we can only use the default value since the percentiles value user passed will be lost in previous step)

warnings no longer propagated to command line

It seems that the warnings, after having been put into the Results object, are not propagated to the calling context (like command line) so the user doesn't necessarily see them.
This should be reverted.

Derived KPIs are passed to Experiment.fixed_horizon_delta() but never used in there

Derived KPIs are passed to Experiment.fixed_horizon_delta() but never used in there - and probably they shouldn't since they have already been taken care of in Experiment.delta() before dispatching.

Support 'overall ratio' metrics (e.g. conversion rate/return rate) as opposed to per-entity ratios

In case of conversion rate or return rate, the KPI can be either defined on the entity level or aggregated over all entities, and we probably want to support both of them.

After some discussion, we came up with the idea to reweight the data of the individual entities to calculate the overall ratio statistics. This enables to use the existing statistics.delta() function to calculate the overall ratio statistics (using normal assumtion or bootstraping).

Calculating return rates

As an example let's look at return rates which are typically calculated (on individual entity level) as:
$individual_rr=1/n \sum_{i=1}^n \frac{ARTICLES-RETURNED_i}{ARTICLES-ORDERED_i}$

The overall ratio is a reweighting of individual_rr to reflect not the entities' contributions (e.g. contribution per custormer) but overall equally contributions to the return return rate (i.e. return rate on overall article basis) which can be formulated as:
$$overall_rr= \frac{1/n \sum_{i=1}^n ARTICLES-RETURNED_i}{1/n \sum_{i=1}^n ARTICLES-ORDERED_i}$$

Overall as reweighted Individual

One can calculate the overall_rr from the individual_rr using the following reweighting (easily proved by paper and pencile):
$$overall_rr=1/n \sum_{i=1}^n \alpha_i \frac{ARTICLES-RETURNED_i}{ARTICLES-ORDERED_i}$$
with
$$alpha_i= n \frac{ARTICLES-ORDERED_i}{\sum_{i=1}^n ARTICLES-ORDERED_i}

Weighted delta function

To have such functionality as a more generic approach in ExpAn we can use introduce a "weighted delta" function in statistics. Its input are

The entity based variables (such as \frac{ARTICLES-RETURNED}{ARTICLES-ORDERED}) - for treatment and control
A variable that specifies the quanities per entitiy (such as ARTICLES-ORDERED) - for treatment and control

With this input it calculates alpha as described above and outputs the result of statistics.delta()

Results.to_json() implementation not flexible

Current implementation will break if certain changes are done in the Results dataframe structure (such as adding additional indexes). Main cause of the issue is the nested for loops used to traverse the dataframe, they can probably be replaced with a recursive function.

Failing early stopping unit tests

Hi, on my Mac with Python 3.6.0 with numpy 1.10.4, scipy 0.17.0, and pystan 2.14.0.0, I've got three failing unit tests:

one:


    def test_bayes_factor(self):
            """
        	Check the Bayes factor function.
        	"""
            stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_factor(self.rand_s1, self.rand_s2)
            self.assertEqual(stop, 1)
            self.assertAlmostEqual(delta, -0.15887364780635896)
>           self.assertAlmostEqual(CI['lower'], -0.24414725578976518)
E           AssertionError: -0.24359237356716665 != -0.24414725578976518 within 7 places

tests/tests_core/test_early_stopping.py:73: AssertionError

two:


    def test_bayes_precision(self):
            """
        	Check the bayes_precision function.
        	"""
            stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_precision(self.rand_s1, self.rand_s2)
            self.assertEqual(stop, 0)
            self.assertAlmostEqual(delta, -0.15887364780635896)
>           self.assertAlmostEqual(CI['lower'], -0.25165623415486293)
E           AssertionError: -0.25058790472284048 != -0.25165623415486293 within 7 places

tests/tests_core/test_early_stopping.py:93: AssertionError

and three

    def test_bayes_precision_delta(self):
            """
    	    Check if Experiment.bayes_precision_delta() functions properly
    	    """
            # this should work
            self.assertTrue(isinstance(self.data, Experiment))  # check that the subclassing works
    
            self.assertTrue(self.data.baseline_variant == 'B')
    
            res = Results(None, metadata=self.data.metadata)
            result = self.data.bayes_precision_delta(result=res, kpis_to_analyse=['normal_same'])
    
            # check uplift
            df = result.statistic('delta', 'uplift', 'normal_same')
            np.testing.assert_almost_equal(df.loc[:, ('value', 'A')],
                                                                       np.array([0.033053]), decimal=5)
            # check stop
            df = result.statistic('delta', 'stop', 'normal_same')
            np.testing.assert_equal(df.loc[:, 'value'],
>                                                           np.array([[0, 0]]))

tests/tests_core/test_experiment.py:492: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

x = variant                                                  A    B
metric      subgroup_metric subgroup statistic pctile          
normal_same -               NaN      stop      NaN     1.0  0.0
y = array([[0, 0]]), err_msg = '', verbose = True

    def assert_array_equal(x, y, err_msg='', verbose=True):
        """
        Raises an AssertionError if two array_like objects are not equal.
    
        Given two array_like objects, check that the shape is equal and all
        elements of these objects are equal. An exception is raised at
        shape mismatch or conflicting values. In contrast to the standard usage
        in numpy, NaNs are compared like numbers, no assertion is raised if
        both objects have NaNs in the same positions.
    
        The usual caution for verifying equality with floating point numbers is
        advised.
    
        Parameters
        ----------
        x : array_like
            The actual object to check.
        y : array_like
            The desired, expected object.
        err_msg : str, optional
            The error message to be printed in case of failure.
        verbose : bool, optional
            If True, the conflicting values are appended to the error message.
    
        Raises
        ------
        AssertionError
            If actual and desired objects are not equal.
    
        See Also
        --------
        assert_allclose: Compare two array_like objects for equality with desired
                         relative and/or absolute precision.
        assert_array_almost_equal_nulp, assert_array_max_ulp, assert_equal
    
        Examples
        --------
        The first assert does not raise an exception:
    
        >>> np.testing.assert_array_equal([1.0,2.33333,np.nan],
        ...                               [np.exp(0),2.33333, np.nan])
    
        Assert fails with numerical inprecision with floats:
    
        >>> np.testing.assert_array_equal([1.0,np.pi,np.nan],
        ...                               [1, np.sqrt(np.pi)**2, np.nan])
        ...
        <type 'exceptions.ValueError'>:
        AssertionError:
        Arrays are not equal
        <BLANKLINE>
        (mismatch 50.0%)
         x: array([ 1.        ,  3.14159265,         NaN])
         y: array([ 1.        ,  3.14159265,         NaN])
    
        Use `assert_allclose` or one of the nulp (number of floating point values)
        functions for these cases instead:
    
        >>> np.testing.assert_allclose([1.0,np.pi,np.nan],
        ...                            [1, np.sqrt(np.pi)**2, np.nan],
        ...                            rtol=1e-10, atol=0)
    
        """
        assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
>                            verbose=verbose, header='Arrays are not equal')
E       AssertionError: 
E       Arrays are not equal
E       
E       (mismatch 50.0%)
E        x: array([[ 1.,  0.]])
E        y: array([[0, 0]])

z-test instead of t-test in the group sequential implementation

In the group sequential implementation is used z-test instead of t-test and sample standard deviation is used instead of population standard deviation. If the population standard deviation is unknown it would be more reasonable to use t-test instead of z-test.