zalando / expan Goto Github PK
View Code? Open in Web Editor NEWOpen-source Python library for statistical analysis of randomised control trials (A/B tests)
License: MIT License
Open-source Python library for statistical analysis of randomised control trials (A/B tests)
License: MIT License
To bin a list x
into N
bins, one could simply go for the bin index given by
binIndex = hash(x[i]) % N
Also, use different defaults for categorical (e.g. 4) and numerical (e.g. 8).
from expan.experiment import Experiment raises an ImportError: No module named experiment.
I have to change it to from expan.core.experiment import Experiment and it works.
Also from tests.test_data import generate_random_data doesn't work, I have to use
from tests.tests_data import generate_random_data but this raises an another error cannot import name generate_random_data.
The ExperimentData
class can handle 2 kinds of input data as the metrics
argument:
time_since_treatment
in the input, and should always be aggregated per unique entity and time pointWe're getting very good coverage percentages out of runtests htmlcov
but the numbers are quite misleading: at least some of the unittests are not actually verifying the returned data properly: e.g. test_experiment.test__feature_check__computation()
doesn't verify the feature check of the feature
feature...
We should:
In any case when we are performing a complex operation on a data-set (like a feature-check of a dataframe) we must be careful to really check the return: otherwise all the code that gets hit by the called operation will be marked as 'covered' but will not actually have been tested!
this is a very general problem... I'm sure others have had the same issue... I wonder how they've dealt with it?
Missing the functionality of weighted KPIs in group_sequential, Bayes_factor, Bayes_precision.
two possible approaches:
Currently, although it's a subclass, no way to construct an Experiment
from an ExperimentData
which is a bit counter-intuitive.
Not really necessary (can always just pass whatever you passed to the ExperimentData
constructor directly to the Experiment
, but we should allow this.
Low priority.
The following code in Experiment.delta
pattern = '([a-zA-Z][0-9a-zA-Z_]*)'
self.kpis.loc[:,kpiName] = eval(re.sub(pattern, r'self.kpis.\1.astype(float)', dk['formula']))
will only work for fomulas with simple ratios (and also too hacky).
However, I found that some of the KPIs here requires sum/count/>0, etc...
By doing so, we can deprecate the baseline_variant argument in the Experiment constructor, so to create an Experiment object, one only needs the kpi_df, feature_df and metadata, which is more intuitive.
Currently the regex does not match numbers when constructing derived metrics from the formula.
Hi, do we need this one?
das ist leider so.
Will there be a version for python3?
In the current implementation, there are three intermediate layers of steps that are nearly always executed upon invocation of a particular analysis method of Experiment
class:
Experiment
class (such as Experiment.delta()
for example) is called,_function_
(a local one to experiment.py
, such as _delta_all_variants()
) is called from that method, and lastly,*_to_dataframe_*
function, which is again analysis-specific.Thus, the details of analysis flow are spread over three independent levels of abstraction, making each of them not quite flexible and partially causing functionality and code duplication.
As I understood from code, percentiles values for regular delta and group sequential delta are using t statistics.
e.g. in the result object, "pctile" could be (2.5, 50, 97.5), "value" can be the corresponding t statistics (0.1, 0.8, 0.1).
On the other hand, percentiles values for two Bayesian delta are using a fixed 0.95 credible interval.
In the result object, "pctile" is always ("lower", "upper"), "value" is the corresponding index of posterior distribution.
I think they are two completely different concept, and if I understood correctly, should we put them into different columns in the result object?
Current: for t-test we assume that two populations have equal variances. The case of non-equal variances is not implemented.
Possible solution: add Welch's t-test implementation for the case of unequal populations variances.
Note: this question should be preliminary investigated: when it's necessary or helpful, when it's not.
Current implementation of the to_json() method in the Results class doesn't support trend() results at the moment.
This will break if one constructs the ExperimentData
object with two separate feature and KPI data frames, in which case we need to merge them manually.
Currently, the ipython notebook (Expan-Intro.ipynb
) is available locally as slides with the serve_intro_slides
script.
It would be great to host the converted notebook on a public server somewhere so that prospective users can view it without cloning or running a local jupyter server.
Hello,
I'm still puzzled by the way Experiment.delta() passes the list of KPIs to analyse to its daughter functions (i.e. fixed horizon, group sequential, and the two Bayes funcs).
More specifically:
Experiment.fixed_horizon_delta()
expects the kpi_subset
argument and at the same time fetches the list of KPIs in from the passed res
dataframe https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L456, after which it kills the metadata of res
kpis_to_analyse
and don't really care about kpis_to_analyse
in res
.Is there any hidden magic in that which I might be missing?
Currently, to import the main classes of expan's functionality, we need to either use prefixes:
import expan; expan.Experiment(...)
or we have to import the class through the intermediate folder name, which is not intuitive:
from expan.core.experiment import Experiment
Ideal would be to allow from expan.experiment import Experiment
although another possibility (and easier to implement) would be from expan import Experiment
.
To represent duration in the time_since_treatment
column, we have 2 options:
time_since_treatment
and persist the time unit in the metadata [non-intuitive and probably error-prone]issues all still in GHE from prior to opensourcing... should migrate those
Experiment.delta()
is now a dispatcher, however those lines https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L368 duplicate the code https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L455
Additional complication: delta()
function calls the other one, i.e. fixed_horizon_delta()
under data
, we still use the 'fetcher' terminology, despite all the advanced data-loading now being elsewhere. We should rename them data loaders, to make clearer that all they're doing is reading from a file and loading into the ExperimentData
structure.
In weighted KPIs, our implementation is based on the assumptions that reference KPIs are never nan.
We might have some miscomputation here. I saw from previous BQ tests that this column can have missing values. We may want to implement the weights in a way that does not rely on this assumption.
The goal of the current implementation of the Results structure is to enable the user to easily concatenate the results of multiple analyses (say delta() + sga()).
This functionality hasn't been used much, so it would be a good idea to rethink the Results structure and maybe aim to simplify it.
This probability will be calculated from the existing result data, based on the percentiles and the normal assumption.
Were temporarily removed after a change of structure... need to reenable them.
Especially because the Results.__str__
function relies on them.
My understanding is that the code
https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L456-L460
is not relevant for the call of delta_all_variants()
at
https://github.com/zalando/expan/blob/dev/expan/core/experiment.py#L475
e.g. when the counts of observed categories in both groups are as follows:
treat_counts
female 39364
male 6561
unknown 4474
control_counts
f 2
female 152099
m 2
male 24299
unknown 15084
the observed frequency table is:
observed_ct
female male unknown f m
0 0.0 39364.0 0.0 6561.0 4474.0
1 2.0 152099.0 2.0 24299.0 15084.0
Here in this line we have specified some additional parameters to StanModel.sampling: https://github.com/zalando/expan/blob/dev/expan/core/early_stopping.py#L189
Do we know what are the effects of specifying stepsize
and adapt_delta
in this case?
In the current implementation, all *_delta()
methods accept a res
argument, being (oftentimes) an empty instance of the Result
class. This instance is then updated with the results of the analysis.
On the other hand, other analysis methods, such as feature_check()
, sga()
, and trend()
create a Result
instance on the fly, populate it with results and return it.
What is the motivation between those two different approaches?
if metadata
contains a UserWarning(...)
, to_json() fails
Currently, requirements.txt is a bit over-zealous with versions: expan doesn't actually need the latest version (0.17) of scipy for example.
To increase backward compatibility, we should go over what's actually required. Potentially long-winded operation I guess but perhaps there's a tool to do it semi-automatically?
normal_difference
calculates percentiles for the difference between normal distributions.
Since the difference between normal distributions is normally distributed this functionality should be covered by a transformation of the input to the difference distribution and passed onto normal_percentiles
.
to enable the standard call expan.version
The parameter percentiles
is lost during the following stack of calls:
-> experiment.group_sequential_delta
(has input para percentiles)
-> early_stopping.group_sequential
(lost input para percentiles)
-> statistics.normal_difference
(has para percentiles again, but now we can only use the default value since the percentiles value user passed will be lost in previous step)
It seems that the warnings, after having been put into the Results
object, are not propagated to the calling context (like command line) so the user doesn't necessarily see them.
This should be reverted.
Derived KPIs are passed to Experiment.fixed_horizon_delta()
but never used in there - and probably they shouldn't since they have already been taken care of in Experiment.delta()
before dispatching.
In case of conversion rate or return rate, the KPI can be either defined on the entity level or aggregated over all entities, and we probably want to support both of them.
After some discussion, we came up with the idea to reweight the data of the individual entities to calculate the overall ratio statistics. This enables to use the existing statistics.delta() function to calculate the overall ratio statistics (using normal assumtion or bootstraping).
As an example let's look at return rates which are typically calculated (on individual entity level) as:
The overall ratio is a reweighting of individual_rr to reflect not the entities' contributions (e.g. contribution per custormer) but overall equally contributions to the return return rate (i.e. return rate on overall article basis) which can be formulated as:
One can calculate the overall_rr from the individual_rr using the following reweighting (easily proved by paper and pencile):
with
$$alpha_i= n \frac{ARTICLES-ORDERED_i}{\sum_{i=1}^n ARTICLES-ORDERED_i}
To have such functionality as a more generic approach in ExpAn we can use introduce a "weighted delta" function in statistics. Its input are
With this input it calculates alpha as described above and outputs the result of statistics.delta()
Current implementation will break if certain changes are done in the Results dataframe structure (such as adding additional indexes). Main cause of the issue is the nested for loops used to traverse the dataframe, they can probably be replaced with a recursive function.
Hi, on my Mac with Python 3.6.0 with numpy 1.10.4, scipy 0.17.0, and pystan 2.14.0.0, I've got three failing unit tests:
one:
def test_bayes_factor(self):
"""
Check the Bayes factor function.
"""
stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_factor(self.rand_s1, self.rand_s2)
self.assertEqual(stop, 1)
self.assertAlmostEqual(delta, -0.15887364780635896)
> self.assertAlmostEqual(CI['lower'], -0.24414725578976518)
E AssertionError: -0.24359237356716665 != -0.24414725578976518 within 7 places
tests/tests_core/test_early_stopping.py:73: AssertionError
two:
def test_bayes_precision(self):
"""
Check the bayes_precision function.
"""
stop,delta,CI,n_x,n_y,mu_x,mu_y = es.bayes_precision(self.rand_s1, self.rand_s2)
self.assertEqual(stop, 0)
self.assertAlmostEqual(delta, -0.15887364780635896)
> self.assertAlmostEqual(CI['lower'], -0.25165623415486293)
E AssertionError: -0.25058790472284048 != -0.25165623415486293 within 7 places
tests/tests_core/test_early_stopping.py:93: AssertionError
and three
def test_bayes_precision_delta(self):
"""
Check if Experiment.bayes_precision_delta() functions properly
"""
# this should work
self.assertTrue(isinstance(self.data, Experiment)) # check that the subclassing works
self.assertTrue(self.data.baseline_variant == 'B')
res = Results(None, metadata=self.data.metadata)
result = self.data.bayes_precision_delta(result=res, kpis_to_analyse=['normal_same'])
# check uplift
df = result.statistic('delta', 'uplift', 'normal_same')
np.testing.assert_almost_equal(df.loc[:, ('value', 'A')],
np.array([0.033053]), decimal=5)
# check stop
df = result.statistic('delta', 'stop', 'normal_same')
np.testing.assert_equal(df.loc[:, 'value'],
> np.array([[0, 0]]))
tests/tests_core/test_experiment.py:492:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
x = variant A B
metric subgroup_metric subgroup statistic pctile
normal_same - NaN stop NaN 1.0 0.0
y = array([[0, 0]]), err_msg = '', verbose = True
def assert_array_equal(x, y, err_msg='', verbose=True):
"""
Raises an AssertionError if two array_like objects are not equal.
Given two array_like objects, check that the shape is equal and all
elements of these objects are equal. An exception is raised at
shape mismatch or conflicting values. In contrast to the standard usage
in numpy, NaNs are compared like numbers, no assertion is raised if
both objects have NaNs in the same positions.
The usual caution for verifying equality with floating point numbers is
advised.
Parameters
----------
x : array_like
The actual object to check.
y : array_like
The desired, expected object.
err_msg : str, optional
The error message to be printed in case of failure.
verbose : bool, optional
If True, the conflicting values are appended to the error message.
Raises
------
AssertionError
If actual and desired objects are not equal.
See Also
--------
assert_allclose: Compare two array_like objects for equality with desired
relative and/or absolute precision.
assert_array_almost_equal_nulp, assert_array_max_ulp, assert_equal
Examples
--------
The first assert does not raise an exception:
>>> np.testing.assert_array_equal([1.0,2.33333,np.nan],
... [np.exp(0),2.33333, np.nan])
Assert fails with numerical inprecision with floats:
>>> np.testing.assert_array_equal([1.0,np.pi,np.nan],
... [1, np.sqrt(np.pi)**2, np.nan])
...
<type 'exceptions.ValueError'>:
AssertionError:
Arrays are not equal
<BLANKLINE>
(mismatch 50.0%)
x: array([ 1. , 3.14159265, NaN])
y: array([ 1. , 3.14159265, NaN])
Use `assert_allclose` or one of the nulp (number of floating point values)
functions for these cases instead:
>>> np.testing.assert_allclose([1.0,np.pi,np.nan],
... [1, np.sqrt(np.pi)**2, np.nan],
... rtol=1e-10, atol=0)
"""
assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
> verbose=verbose, header='Arrays are not equal')
E AssertionError:
E Arrays are not equal
E
E (mismatch 50.0%)
E x: array([[ 1., 0.]])
E y: array([[0, 0]])
In the group sequential implementation is used z-test instead of t-test and sample standard deviation is used instead of population standard deviation. If the population standard deviation is unknown it would be more reasonable to use t-test instead of z-test.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.