mobiletelesystems / ambrosia Goto Github PK

Ambrosia is a Python library for A/B tests design, split and result measurement

License: Apache License 2.0

Makefile 0.39% Python 99.61%

ab-testing experiment-design split-testing statistical-inference statistics

ambrosia's Introduction

Ambrosia

Ambrosia is a Python library for A/B tests design, split and effect measurement. It provides rich set of methods for conducting full A/B testing pipeline.

The project is intended for use in research and production environments based on data in pandas and Spark format.

Key functionality

Pilots design 🛫
Multi-group split 🎳
Matching of new control group to the existing pilot 🎏
Experiments result evaluation as p-value, point estimate of effect and confidence interval 🎞
Data preprocessing ✂️
Experiments acceleration 🎢

Documentation

For more details, see the Documentation and Tutorials.

Installation

You can always get the newest Ambrosia release using pip. Stable version is released on every tag to main branch.

pip install ambrosia

Starting from version 0.4.0, the ability to process PySpark data is optional and can be enabled using pip extras during the installation.

pip install ambrosia[spark]

Usage

The main functionality of Ambrosia is contained in several core classes and methods, which are autonomic for each stage of an experiment and have very intuitive interface.

Below is a brief overview example of using a set of three classes to conduct some simple experiment.

Designer

from ambrosia.designer import Designer
designer = Designer(dataframe=df, effects=1.2, metrics='portfel_clc') # 20% effect, and loaded data frame df
designer.run('size')

Splitter

from ambrosia.splitter import Splitter
splitter = Splitter(dataframe=df, id_column='id') # loaded data frame df with column with id - 'id'
splitter.run(groups_size=500, method='simple')

Tester

from ambrosia.tester import Tester
tester = Tester(dataframe=df, column_groups='group') # loaded data frame df with groups info 'group'
tester.run(metrics='retention', method='theory', criterion='ttest')

Development

To install all requirements run

make install

You must have python3 and poetry installed.

For autoformatting run

make autoformat

For linters check run

make lint

For tests run

make test

For coverage run

make coverage

To remove virtual environment run

make clean

Authors

Developers and evangelists:

ambrosia's People

Contributors

Stargazers

Watchers

Forkers

andreyshishkov sandy4321 hochupivas ruslansblv xandaau cameleogrey shvetsiya shevchenkoav bdata0 let0pda jeeves1979 khodyakovamari pavel0109 elesingp

ambrosia's Issues

`Tester` PySpark data support

This issue was created to track the development of PySpark support for methods of the Tester class.

The current functionality of the Tester does not support any operations on spark data. However, this is very important for big data scenarios, and given that we already had PySpark support for Designer and Tester, it seems, that such Tester enhancement is vital for us.

In my opinion we should focus on these two points:

Developing a PySpark implementation of stat criteria (is essential for Designer and others) #19
Optimizing the execution speed of Tester methods

`Splitter` usage in `Designer`

The Designer class in its work uses own methods for generating subsamples in an empirical approach.
Let's think about switching to using the Splitter class for the tasks of subsamples generation inside the Designer.

Currently, I see the following advantages of this choice:

The code will be more consistent and clear because the Splitter is designed specifically to generate subsamples(actually set of subgroups).
In the future, there will be no need to duplicate the same features for split and design methods.
Simultaneous refactoring of old and duplicated pieces of code in tools.py and others modules
We could pass custom configuration for the Splitter instance inside the methods, which would help keep the empirical design more flexible and correct for custom splitting.

The cons should be considered as well, for example, there may be some problems with generation of a large number of group pairs with current structure of the Splitter.

`Splitter` unequal group sizes split

Current Splitter functionality allows one to generate n groups of equal size m from some dataframe using different split methods.

But several types of experiments require split with unequal group sizes. It would be nice to implement this feature in the Splitter tools.
As one of the options, group sizes could be controlled using the modified groups_size parameter (from run method) of length n with different values for group sizes inside, for example: [1000, 100, 100]. If this parameter is equal to single number, the split will be made into groups of the same size.

In the future, this feature can also be used in the Designer empirical design of unequal sized groups, if we decide to integrate the Splitter into Designer methods #16.

``PySpark`` support for ``Cuped`` class

Part of big issue for preprocessing enhancement.

To speed up A/B tests on big data, it is necessary to implement support for PySpark dataframes in the Cuped class.
It is necessary to think about how to structurally decompose code of functions for data on PySpark and pandas.

Multiple comparisons problem corrections

In the Tester class when one use multiple experimantal groups or several metrics only Bonferroni correction is supported.

It will be useful to implement some more complex and popular classic corrections for MCP (Holm, Benjamini–Hochberg, etc).
It should be noted that current structure of `Tester`` may not be so convenient to add these corrections, so it will need to change the main class code.
These corrections should be discussed before the implementation as well as the problem of correct confidence intervals calculation.

Usage examples expanding and reorganization

Current set of usage examples is not self-sufficient and detailed enough for some segment of users.
Therefore, it should be expanded and reorganized.

I am thinking about the following schema:

Detailed preprocessing classes examples
Detailed variance reduction techniques examples
Detailed experiment designing example
Detailed groups splitting example
Detailed experiment effect estimation example
Full A/B pipeline example
Spark api examples (current functionality is not big enough for separated examples, we can do this in future)

Approximation-based Designs for Binary Data

Currently Ambrosia supports only simulation-based power calculations for experiments with binary outcomes (see design_binary_size ultimately referencing __helper_calc_empirical_power).

One could rely on approximations to arrive at an analytical expression for power. First, consider variance-stabilising transformation of the proportions in the control ( $p_1$ ) and the treated group ( $p_2$ ) and express power of a two-sided two-sample test for proportions as:

$$(1-\beta) = \Phi \left( \Phi^{-1}\left( \frac{\alpha}{2} \right) - 2 \left( \arcsin \sqrt{p_1} - \arcsin \sqrt{p_2} \right) \sqrt{\frac{n}{2}}\right) + \left(1 - \Phi \left( \Phi^{-1}\left( \frac{\alpha}{2} \right) - 2 \left( \arcsin \sqrt{p_1} - \arcsin \sqrt{p_2} \right) \sqrt{\frac{n}{2}}\right)\right)$$

and search for either of $<\beta,p_1,p_2,\alpha,n>$, holding the other four fixed, such that the function reaches zero.

Second, when $n$ is large enough one could rely on Normal approximations of the binomial distribution and express power of the two-sided test as

$$(1-\beta) = \Phi \left( \frac{ \sqrt{n} \left| p_1 - p_2 \right| + \Phi^{-1} \left( \frac{\alpha}{2} \right) \sqrt{ \left( p_1 + p_2 \right) \left( 1 - \left( p_1 + p_2 \right) \right)} } { \sqrt{p_1 \left( 1 - p_1 \right) + p_2 \left( 1 - p_2 \right)} }\right)$$

and perform the same search.

Let us analytically solve a problem in your 4_usage_example_binary_design.ipynb: find $n$ such that we are able to detect a 5% increase in experimental group proportion vis-à-vis the control group proportion of 5% with type-I error of 5% and type-II error of 20%. In R parlance the solution is:

effect <- 1.05

p1 <- 0.05
p2 <- 0.05*effect
sig.level <- 0.05
power <- 0.8
tol <- .Machine$double.eps^0.25

# Variance-stabilising transformation
h <- 2 * asin(sqrt(p1)) - 2 * asin(sqrt(p2))

p.asin <- quote({pnorm(qnorm(sig.level/2, lower = F) - h * sqrt(n/2), lower = F) + pnorm(qnorm(sig.level/2, lower = T) - h * sqrt(n/2), lower = T)})

# Normal approximation of the binomial distribution
p.normal <- quote(pnorm((sqrt(n) * abs(p1 - p2) - (qnorm(sig.level/2, lower.tail = F) * sqrt((p1 + p2) * (1 - (p1 + p2)/2))))/sqrt(p1 * (1 - p1) + p2 * (1 - p2))))

# Solve for n
n.asin <- stats::uniroot(function(n) eval(p.asin) - power, c(2 + 1e-10, 1e+09))$root

n.normal <- stats::uniroot(function(n) eval(p.normal) - power, c(2 + 1e-10, 1e+09))$root

# What is n to achieve the MDE of interest under two approximations?
n.asin # 122106.8
n.normal # 122123.5

This is a self-contained solution that could be easily translated into Python. It is taken from the existing routines:

# Variance stabilising transformation-based
pwr::pwr.2p.test(h = ES.h(0.05, 0.05*effect), power = 0.8, sig.level = 0.05)
#     Difference of proportion power calculation for binomial distribution (arcsine transformation) 
#
#              h = 0.01133831
#              n = 122106.8
#      sig.level = 0.05
#          power = 0.8
#    alternative = two.sided
#
#NOTE: same sample sizes

# Normal approximation-based
stats::power.prop.test(n = NULL, p1 = 0.05, p2 = 0.05*effect, power = 0.8, sig.level = 0.05) 
#     Two-sample comparison of proportions power calculation 
#
#              n = 122123.5
#             p1 = 0.05
#             p2 = 0.0525
#      sig.level = 0.05
#          power = 0.8
#    alternative = two.sided
#
#NOTE: n is number in *each* group

I think offering analytical methods in binary designs using the above approximations could be a valuable alternative to your simulation-based power calculations since the former are commonplace in statistics.

`Splitter` multigroup fractional split feature

The current fractional feature of the Splitter class only supports splitting into two groups.
In some tasks, it is necessary to make a multigroup partition of a given table. It would be nice if we extended our functionality with such a function.
I think it will be convenient for users to control the division of fractions between groups using the analogue of the part_of_table parameter, but in the form of a list/iterator: [0.5, 0.1, 0.1, 0.1, 0.1, 0.1].

PySpark statistical criteria implementation

In order to extend the ambrosia functionality for working with spark data to an acceptable level, it is necessary to implement a set of PySpark statistical criteria classes at ambrosia.spark_tools.stat_criteria

Fittable `RobustPreprocessor`

The current RobustPreprocessor class dynamically calculates quantile values for a given set of columns and removes outliers from them during the execution of the run method.

In some problems, we need to remove outliers based on pre-selected quantile values. For example, if we have a treatment applied to group B and control group A, it is necessary to clean up outliers using pre-experimental data in order to perform the experiment correctly.

To do this, it is necessary to reconsider the structure of the RobustPreprocessor class.
One way to solve this problem is to implement fit and transform methods. Then when we will have such storable parameters as quantile values for columns, the store-like and load-like methods of the class are also essential.

Also, it's good to think about keeping some ability of the class to remove outliers without any fitting.

Absolute effects support in `Designer`

The utility of implementing the delta_type parameter in the Designer methods needs to be discussed.
This parameter is dedicated for handling relative and absolute effect types.

Implementation of basic PySpark data preprocessing methods

For the tasks of preprocessing pandas data and speeding up experiments, we have the Preprocessor class and a number of base classes with single functionality at preprocessing.
These methods should be implemented for spark dataframes, in the same paradigm as we have for the Designer and the Splitter.

At this moment, the implementation of the following methods is essential:

Aggregation
Outliers removal (robust)
CUPED

Metric split for spark tables

Metric split is not supported for spark tables.
Simple version with 1 covariate column (fit_columns) can be implemented via sorting.

`Designer` relative effect empiric design

When we use the Designer class to design a parameter of interest, we operate on the effect input values in the following relative form: [1.01, 1.02, ...].
This is a pretty handy notation for a variable, but further we always use in our calculations and stat criteria not really these relative effects, but an absolute type of the effect.

As far as the theoretical approach is concerned, this may be fine, for empirical approaches we can make adjustments and start to distinguish between relative and absolute effects.

For empirical methods, we can implement the same functionality in the Designer class as in Tester: handling "absolute" and "relative" effects.

One way to do this is to simply start instantiating a Tester inside empiric methods(mainly stat_criterion_power method) and pass all necessary arguments to it. The Tester class already has all the implemented functions for all the statistical tests in the package.

The notation of relative effects mentioned earlier could remain the same - but now it will be an additional possible effect_type argument that will be passed to the Designer and further to the Tester, which will have two possible values "absolute" (default) and "relative".

Linearization of ratio metrics

In order to deal with ratio metrics in the correct way, it is useful to implement classes in the preprocessing module that perform the linearization of these metrics.

This should look like standard ambrosia classes supporting Taylor linearization technique and approach from Yandex article.

``Pandas`` version fix

In the requirements we have pandas version >=0.24.0, however some code in Designer class (actually pivot_table method) crashes when pandas version is less than 1.3.0.

This is needed to be fixed, and can be done in two ways:

Upgrade pandas version in requirements and check that everything is okay
Rewrite code for pivot_table() and leave the appropriate version of the dependency as the older one

Short error snippet

get_empirical_table_sample_size
    report = report.pivot_table(
TypeError: pivot_table() got an unexpected keyword argument 'sort'

Paired bootstrap criterion

The BootstrapStats class is currently not suitable for scenarios where objects in groups are paired (dependent).

For these tasks, we must use consistent sampling, so for each step we select a dependent pair of objects from the experimental groups, rather than independent objects individually.

It is necessary to implement the function of consistent sampling of objects for BootstrapStats.

Add methods for handling outliers

Make

log transformation
IQR transformation

Fractional split bug on duplicated dataframes indices

Fractional split feature of Splitter returns an undesired result when one tries to split a pandas dataframe with duplicated indices without passing any argument for id_column.

The following examples are illustrating the bug.

Let's create a dataframe with duplicated indices:

import pandas as pd

# Create separate dfs
df_1 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_1['frame'] = 1

df_2 = pd.DataFrame(np.random.normal(size=(5_000, )),
                   columns=["metric_val"])
df_2['frame'] = 2

# Concat and shuffle
dataframe = pd.concat([df_1, df_2]).sample(frac=1)

Now perform a fractional split on it:

from ambrosia.splitter import Splitter

# Create `Splitter` instance and make split based on dataframe index (no `id_column` provided)
splitter = Splitter()
factor = 0.5

result_1 = splitter.run(dataframe=dataframe, 
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')
result_1.group.value_counts()

# Output:
# A    15000
# B    10000
# Name: group, dtype: int64

So, some of the objects after the split are duplicated and now appear in groups several times.
We can see that totally groups are bigger than the original dataframe.

This behaviour does not repeat if we try to split dataframe on the column with duplicated ids.

# Create column from dataframe indices and split on it

dataframe = dataframe.reset_index().rename(columns={'index': 'id_column'})

result_2 = splitter.run(dataframe=dataframe, 
                        id_column='id_column',
                        method='hash', 
                        part_of_table=factor,
                        salt='bug')

result_2.group.value_counts()

# Output:
# A    5000
# B    5000
# Name: group, dtype: int64

But if we look deeper, there is another unusual behaviour:

# Let's count objects origin dataframe frequencies in group A

result_2[result_2.group == 'A'].frame.value_counts()

# Output:
# A    2500
# B    2500
# Name: frame, dtype: int64

Objects from two original dataframes appear in the group equally, which in general is not desired.
This should be inspected further.

Bug was not checked on Spark implementation of same methods, but the care should be taken for them as well.

At the end, I want to add that duplicate indices are undesirable on the id column in the vast majority of splitting issues.
It will be nice to add duplicated id check in Splitter and warn user via logger.

`Designer` for multi groups A/B tests.

Check the concept and practice of multi groups A/B tests.
Implement the raw structure of new Designer
Offer the new architecture of Designer considering classical / multi tests.

Move pyspark into extra dependencies

Loadable and storable `Preprocessor`

In the current implementation of the Preprocessor it is possible to load the parameters of the cuped and multicuped methods using the path to the json file.

It would be good to develop user convenient Preprocessor methods that allow you to store and load the entire instance using, for example, json file.

This problem depends on the ability of RobustPreprocessor to deal with parameters storage #14.
We can also consider about AggregatePreprocessor ability to save and load parameters.

Ratio Metrics

No plans to add support for Ratio Metrics?
(linearization or delta method).
Or give examples of how you work with them within Ambrosia?