Giter Site home page Giter Site logo

bayesian-testing's Introduction

Tests Codecov PyPI

Bayesian A/B testing

bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.

Implemented tests:

  • BinaryDataTest
    • Input data - binary data ([0, 1, 0, ...])
    • Designed for conversion-like data A/B testing.
  • NormalDataTest
    • Input data - normal data with unknown variance
    • Designed for normal data A/B testing.
  • DeltaLognormalDataTest
    • Input data - lognormal data with zeros
    • Designed for revenue-like data A/B testing.
  • DeltaNormalDataTest
    • Input data - normal data with zeros
    • Designed for profit-like data A/B testing.
  • DiscreteDataTest
    • Input data - categorical data with numerical categories
    • Designed for discrete data A/B testing (e.g. dice rolls, star ratings, 1-10 ratings, etc.).
  • PoissonDataTest
    • Input data - non-negative integers ([1, 0, 3, ...])
    • Designed for poisson data A/B testing.
  • ExponentialDataTest
    • Input data - exponential data (non-negative real numbers)
    • Designed for exponential data A/B testing (e.g. session/waiting time, time between events, etc.).

Implemented evaluation metrics:

  • Probability of Being Best
    • Probability that a given variant is best among all variants.
    • By default, the best is equivalent to the greatest (from a data/metric point of view), however it is possible to change this by using min_is_best=True in the evaluation method (this can be useful if we try to find the variant while minimizing the tested measure).
  • Expected Loss
    • "Risk" of choosing particular variant over other variants in the test.
    • Measured in the same units as a tested measure (e.g. positive rate or average value).

Both evaluation metrics are calculated using simulations from posterior distributions (considering given data).

Installation

bayesian_testing can be installed using pip:

pip install bayesian_testing

Alternatively, you can clone the repository and use poetry manually:

cd bayesian_testing
pip install poetry
poetry install
poetry shell

Basic Usage

The primary features are classes:

  • BinaryDataTest
  • NormalDataTest
  • DeltaLognormalDataTest
  • DeltaNormalDataTest
  • DiscreteDataTest
  • PoissonDataTest
  • ExponentialDataTest

All test classes support two methods to insert the data:

  • add_variant_data - Adding raw data for a variant as a list of observations (or numpy 1-D array).
  • add_variant_data_agg - Adding aggregated variant data (this can be practical for a large data, as the aggregation can be done already on a database level).

Both methods for adding data allow specification of prior distributions (see details in respective docstrings). Default prior setup should be sufficient for most of the cases (e.g. cases with unknown priors or large amounts of data).

To get the results of the test, simply call the method evaluate.

Probabilities of being best and expected loss are approximated using simulations, hence the evaluate method can return slightly different values for different runs. To stabilize it, you can set the sim_count parameter of the evaluate to a higher value (default value is 20K), or even use the seed parameter to fix it completely.

BinaryDataTest

Class for a Bayesian A/B test for the binary-like data (e.g. conversions, successes, etc.).

Example:

import numpy as np
from bayesian_testing.experiments import BinaryDataTest

# generating some random data
rng = np.random.default_rng(52)
# random 1x1500 array of 0/1 data with 5.2% probability for 1:
data_a = rng.binomial(n=1, p=0.052, size=1500)
# random 1x1200 array of 0/1 data with 6.7% probability for 1:
data_b = rng.binomial(n=1, p=0.067, size=1200)

# initialize a test:
test = BinaryDataTest()

# add variant using raw data (arrays of zeros and ones):
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
# priors can be specified like this (default for this test is a=b=1/2):
# test.add_variant_data("B", data_b, a_prior=1, b_prior=20)

# add variant using aggregated data (same as raw data with 950 zeros and 50 ones):
test.add_variant_data_agg("C", totals=1000, positives=50)

# evaluate test:
results = test.evaluate()
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------+-----------+---------------+----------------+-----------------+---------------+
| variant | totals | positives | positive_rate | posterior_mean | prob_being_best | expected_loss |
+=========+========+===========+===============+================+=================+===============+
| A       |   1500 |        80 |       0.05333 |        0.05363 |         0.067   |     0.0138102 |
+---------+--------+-----------+---------------+----------------+-----------------+---------------+
| B       |   1200 |        80 |       0.06667 |        0.06703 |         0.88975 |     0.0004622 |
+---------+--------+-----------+---------------+----------------+-----------------+---------------+
| C       |   1000 |        50 |       0.05    |        0.05045 |         0.04325 |     0.0169686 |
+---------+--------+-----------+---------------+----------------+-----------------+---------------+

NormalDataTest

Class for a Bayesian A/B test for the normal data.

Example:

import numpy as np
from bayesian_testing.experiments import NormalDataTest

# generating some random data
rng = np.random.default_rng(21)
data_a = rng.normal(7.2, 2, 1000)
data_b = rng.normal(7.1, 2, 800)
data_c = rng.normal(7.0, 4, 500)

# initialize a test:
test = NormalDataTest()

# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
# test.add_variant_data("C", data_c)

# add variant using aggregated data:
test.add_variant_data_agg("C", len(data_c), sum(data_c), sum(np.square(data_c)))

# evaluate test:
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False)
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------+------------+------------+----------------+-----------------+---------------+
| variant | totals | sum_values | avg_values | posterior_mean | prob_being_best | expected_loss |
+=========+========+============+============+================+=================+===============+
| A       |   1000 |    7294.68 |    7.29468 |        7.29462 |         0.1707  |     0.196874  |
+---------+--------+------------+------------+----------------+-----------------+---------------+
| B       |    800 |    5685.86 |    7.10733 |        7.10725 |         0.00125 |     0.385112  |
+---------+--------+------------+------------+----------------+-----------------+---------------+
| C       |    500 |    3736.92 |    7.47383 |        7.4737  |         0.82805 |     0.0169998 |
+---------+--------+------------+------------+----------------+-----------------+---------------+

DeltaLognormalDataTest

Class for a Bayesian A/B test for the delta-lognormal data (log-normal with zeros). Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue but non-zero values are positive values with possible log-normal distribution. To handle this data, the calculation is combining binary Bayes model for zero vs non-zero "conversions" and log-normal model for non-zero values.

Example:

import numpy as np
from bayesian_testing.experiments import DeltaLognormalDataTest

test = DeltaLognormalDataTest()

data_a = [7.1, 0.3, 5.9, 0, 1.3, 0.3, 0, 1.2, 0, 3.6, 0, 1.5, 2.2, 0, 4.9, 0, 0, 1.1, 0, 0, 7.1, 0, 6.9, 0]
data_b = [4.0, 0, 3.3, 19.3, 18.5, 0, 0, 0, 12.9, 0, 0, 0, 10.2, 0, 0, 23.1, 0, 3.7, 0, 0, 11.3, 10.0, 0, 18.3, 12.1]

# adding variant using raw data:
test.add_variant_data("A", data_a)
# test.add_variant_data("B", data_b)

# alternatively, variant can be also added using aggregated data
# (looks more complicated, but it can be quite handy for a large data):
test.add_variant_data_agg(
    name="B",
    totals=len(data_b),
    positives=sum(x > 0 for x in data_b),
    sum_values=sum(data_b),
    sum_logs=sum([np.log(x) for x in data_b if x > 0]),
    sum_logs_2=sum([np.square(np.log(x)) for x in data_b if x > 0])
)

# evaluate test:
results = test.evaluate(seed=21)
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------+-----------+------------+------------+---------------------+-----------------+---------------+
| variant | totals | positives | sum_values | avg_values | avg_positive_values | prob_being_best | expected_loss |
+=========+========+===========+============+============+=====================+=================+===============+
| A       |     24 |        13 |       43.4 |    1.80833 |             3.33846 |         0.04815 |      4.09411  |
+---------+--------+-----------+------------+------------+---------------------+-----------------+---------------+
| B       |     25 |        12 |      146.7 |    5.868   |            12.225   |         0.95185 |      0.158863 |
+---------+--------+-----------+------------+------------+---------------------+-----------------+---------------+

Note: Alternatively, DeltaNormalDataTest can be used for a case when conversions are not necessarily positive values.

DiscreteDataTest

Class for a Bayesian A/B test for the discrete data with finite number of numerical categories (states), representing some value. This test can be used for instance for dice rolls data (when looking for the "best" of multiple dice) or rating data (e.g. 1-5 stars or 1-10 scale).

Example:

from bayesian_testing.experiments import DiscreteDataTest

# dice rolls data for 3 dice - A, B, C
data_a = [2, 5, 1, 4, 6, 2, 2, 6, 3, 2, 6, 3, 4, 6, 3, 1, 6, 3, 5, 6]
data_b = [1, 2, 2, 2, 2, 3, 2, 3, 4, 2]
data_c = [1, 3, 6, 5, 4]

# initialize a test with all possible states (i.e. numerical categories):
test = DiscreteDataTest(states=[1, 2, 3, 4, 5, 6])

# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data("C", data_c)

# add variant using aggregated data:
# test.add_variant_data_agg("C", [1, 0, 1, 1, 1, 1]) # equivalent to rolls in data_c

# evaluate test:
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False)
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------------------------------------------------+---------------+-----------------+---------------+
| variant | concentration                                    | average_value | prob_being_best | expected_loss |
+=========+==================================================+===============+=================+===============+
| A       | {1: 2.0, 2: 4.0, 3: 4.0, 4: 2.0, 5: 2.0, 6: 6.0} |           3.8 |         0.54685 |      0.199953 |
+---------+--------------------------------------------------+---------------+-----------------+---------------+
| B       | {1: 1.0, 2: 6.0, 3: 2.0, 4: 1.0, 5: 0.0, 6: 0.0} |           2.3 |         0.008   |      1.18268  |
+---------+--------------------------------------------------+---------------+-----------------+---------------+
| C       | {1: 1.0, 2: 0.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0} |           3.8 |         0.44515 |      0.287025 |
+---------+--------------------------------------------------+---------------+-----------------+---------------+

PoissonDataTest

Class for a Bayesian A/B test for the poisson data.

Example:

from bayesian_testing.experiments import PoissonDataTest

# goals received - so less is better (duh...)
psg_goals_against = [0, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 3, 1, 0]
city_goals_against = [0, 0, 3, 2, 0, 1, 0, 3, 0, 1, 1, 0, 1, 2]
bayern_goals_against = [1, 0, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 2, 1, 0]

# initialize a test:
test = PoissonDataTest()

# add variant using raw data:
test.add_variant_data('psg', psg_goals_against)

# example with specific priors
# ("b_prior" as an effective sample size, and "a_prior/b_prior" as a prior mean):
test.add_variant_data('city', city_goals_against, a_prior=3, b_prior=1)
# test.add_variant_data('bayern', bayern_goals_against)

# add variant using aggregated data:
test.add_variant_data_agg("bayern", len(bayern_goals_against), sum(bayern_goals_against))

# evaluate test (since fewer goals is better, we explicitly set the min_is_best to True)
results = test.evaluate(sim_count=20000, seed=52, min_is_best=True)
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| variant | totals | sum_values | observed_average | posterior_mean | prob_being_best | expected_loss |
+=========+========+============+==================+================+=================+===============+
| psg     |     15 |          9 |          0.6     |        0.60265 |         0.78175 |     0.0369998 |
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| city    |     14 |         14 |          1       |        1.13333 |         0.0344  |     0.562055  |
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| bayern  |     15 |         13 |          0.86667 |        0.86755 |         0.18385 |     0.300335  |
+---------+--------+------------+------------------+----------------+-----------------+---------------+

note: Since we set min_is_best=True (because received goals are "bad"), probability and loss are in a favor of variants with lower posterior means.

ExponentialDataTest

Class for a Bayesian A/B test for the exponential data.

Example:

import numpy as np
from bayesian_testing.experiments import ExponentialDataTest

# waiting times for 3 different variants, each with many observations,
# generated using exponential distributions with defined scales (expected values)
waiting_times_a = np.random.exponential(scale=10, size=200)
waiting_times_b = np.random.exponential(scale=11, size=210)
waiting_times_c = np.random.exponential(scale=11, size=220)

# initialize a test:
test = ExponentialDataTest()
# adding variants using the observation data:
test.add_variant_data('A', waiting_times_a)
test.add_variant_data('B', waiting_times_b)
test.add_variant_data('C', waiting_times_c)

# alternatively, add variants using aggregated data:
# test.add_variant_data_agg('A', len(waiting_times_a), sum(waiting_times_a))

# evaluate test (since a lower waiting time is better, we explicitly set the min_is_best to True)
results = test.evaluate(sim_count=20000, min_is_best=True)
results # print(pd.DataFrame(results).to_markdown(tablefmt="grid", index=False))
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| variant | totals | sum_values | observed_average | posterior_mean | prob_being_best | expected_loss |
+=========+========+============+==================+================+=================+===============+
| A       |    200 |    1884.18 |          9.42092 |        9.41671 |         0.89785 |     0.0395505 |
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| B       |    210 |    2350.03 |         11.1906  |       11.1858  |         0.03405 |     1.80781   |
+---------+--------+------------+------------------+----------------+-----------------+---------------+
| C       |    220 |    2380.65 |         10.8211  |       10.8167  |         0.0681  |     1.4408    |
+---------+--------+------------+------------------+----------------+-----------------+---------------+

Development

To set up a development environment, use Poetry and pre-commit:

pip install poetry
poetry install
poetry run pre-commit install

To be implemented

Additional metrics:

  • Potential Value Remaining

References

bayesian-testing's People

Contributors

matt52 avatar yaseminsavas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bayesian-testing's Issues

the principle behind computing posterior probability and the estimate_probabilities

Hi,first i like your code! Thanks a lot!
I want to know the principles about gamma_posteriors/dirichlet_posteriors/normal_posteriors/lognormal_posteriors
I found something about gamma_posteriors in VWO_SmartStats_technical_whitepaper, but it looks different. (I cant understand the
1 / (b_priors_gamma[i] + totals[i]) in gamma_posteriors)
Is there any recommended material or website that i can learn about the different posteriors and the estimate_probabilities function?
Thanks a lot!

Is Beta(1/2, 1/2) the right default for a non-information prior?

Default prior setup is set for Beta(1/2, 1/2) which is non-information prior.

Just a quick observation: if you want a flat Beta prior as the default, then based on numpy's parameterization of the Beta function I think Beta(1, 1) is the right choice.

Beta(1/2, 1/2) blows up at p=0 and p=1, and has a minimum at p=1/2. (See here.) If that's the behavior you want, please ignore this issue. But if you want a flat default then consider changing to Beta(1,1).

Cheers!

'Totals' meaning in DeltaNormalDataTest

Test example: there 3 users, only one user purchased something (twice). Nobody else purchased anything.

What should be the value of 'totals'?

  • the number of ALL participants of the test. totals = 3.
  • the number of participants who generated at least something. totals = 1
  • the number of purchases made. totals=2

Results are different from online tool

Hi,

I tested your library and cross-checked against this online calculator:
Here is the result from your library:

[{'variant': 'True True True False False False False',
  'totals': 1172,
  'positives': 461,
  'positive_rate': 0.39334,
  'prob_being_best': 0.7422,
  'expected_loss': 0.0582635},
 {'variant': 'False True True False False False False',
  'totals': 222,
  'positives': 27,
  'positive_rate': 0.12162,
  'prob_being_best': 0.0,
  'expected_loss': 0.3280173},
 {'variant': 'False False True False False False False',
  'totals': 1363,
  'positives': 63,
  'positive_rate': 0.04622,
  'prob_being_best': 0.0,
  'expected_loss': 0.4051768},
 {'variant': 'False False False False False False False',
  'totals': 1052,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.0,
  'expected_loss': 0.4512031},
 {'variant': 'True False True False False False False',
  'totals': 1,
  'positives': 0,
  'positive_rate': 0.0,
  'prob_being_best': 0.2578,
  'expected_loss': 0.1997566}]

So the best variant has 74% probability to be the winner. On the online calculator it is 63.48% instead (last variant is 36.52% instead of 25.78%).

I used the BinaryDataTest() without any priors.

I did not dig deeper on what might be right here, but wanted to drop this as feedback.

Credible intervals

Thank you for the excellent package! I was wondering if this package could provide credible intervals for each variant when reporting A/B testing results.

Minimum sample size

First, this package is great! I wanted to know if the probability estimates rely on a minimum sample size or how one might go about determining minimum sample size for a Binary test, for example.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.