The precon from onsbigdata

Chain function produces incorrect indices if period missing

The chain does not handle missing periods correctly but still produces a result.

import pandas as pd
from pandas import Timestamp
import precon

df_all_periods = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), 100.478600),  # April 2019 value present
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_period_missing = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 101.093900),
        (Timestamp('2019-03-01'), 101.726900),
        (Timestamp('2019-04-01'), None),  # April 2019 value missing
        (Timestamp('2019-05-01'), 100.647800),
        (Timestamp('2019-06-01'), 100.439100),
        (Timestamp('2019-07-01'), 102.181900),
        (Timestamp('2019-08-01'), 100.608800),
        (Timestamp('2019-09-01'), 102.067000),
        (Timestamp('2019-10-01'), 102.418300),
        (Timestamp('2019-11-01'), 102.769600),
        (Timestamp('2019-12-01'), 103.120900),
        (Timestamp('2020-01-01'), 103.519414),
        (Timestamp('2020-02-01'), 100.710500),
    ],
    columns=('period', 'index_value'),
).set_index('period')

expected = pd.DataFrame.from_records([
        (Timestamp('2018-01-01'), 100.000000),
        (Timestamp('2018-02-01'), 100.527400),
        (Timestamp('2018-03-01'), 100.894000),
        (Timestamp('2018-04-01'), 100.689100),
        (Timestamp('2018-05-01'), 102.670400),
        (Timestamp('2018-06-01'), 100.811000),
        (Timestamp('2018-07-01'), 102.632500),
        (Timestamp('2018-08-01'), 103.133200),
        (Timestamp('2018-09-01'), 103.111400),
        (Timestamp('2018-10-01'), 103.417700),
        (Timestamp('2018-11-01'), 103.155800),
        (Timestamp('2018-12-01'), 103.616800),
        (Timestamp('2019-01-01'), 104.246480),
        (Timestamp('2019-02-01'), 105.386833),
        (Timestamp('2019-03-01'), 106.046713),
        (Timestamp('2019-04-01'), 104.745404),
        (Timestamp('2019-05-01'), 104.921789),
        (Timestamp('2019-06-01'), 104.704227),
        (Timestamp('2019-07-01'), 106.521034),
        (Timestamp('2019-08-01'), 104.881133),
        (Timestamp('2019-09-01'), 106.401255),
        (Timestamp('2019-10-01'), 106.767473),
        (Timestamp('2019-11-01'), 107.133691),
        (Timestamp('2019-12-01'), 107.499909),
        (Timestamp('2020-01-01'), 107.915346),
        (Timestamp('2020-02-01'), 108.682084),
    ],
    columns=('period', 'index_value'),
).set_index('period')

df_all_periods['chained'] = precon.chain(df_all_periods)

df_period_missing['chained'] = precon.chain(df_period_missing)

pd.concat([df_all_periods, df_period_missing, expected], keys=['all_periods', 'period_missing', 'expected'], axis=1)

In the above example expected is calculated for if all periods are present but using the equation of unlinked index * linked base / 100 so the chained indices after the missing period are not affected. precon.chain doesn't have an issue as it uses a backfill after shifting the indices by one period to fill in the first month.

Add spaces either side of the colon in all docstrings

This is to abide by the numpy style convention.

Added extra contributions functionality from faster indicators project

Additional contributions functions were developed for the consumer prices faster indicators project. Pull these into precon.

period_on_period_contributions
contributions_level
contributions_up_hierarchy

Review existing contributions code and add some tests and documentation.

Remove ternary operator on first line of _get_values_to_adjust function

The ternary operator is unnecessary here - a simple conditional will do since it returns True or False anyway.

precon/precon/rounding.py

Line 82 in 46752ea

asc = True if np.sign(no_of_adjustments) == -1 else False

Add a fillna(0) to the weights in the aggregation method to stop Zero Division bug

Still totally unsure whether this will solve the issue a user is experiencing, but in some adapted code the lines were:

    zeros_and_nans = indices.isna() | indices.eq(0)
    weights = weights.mask(zeros_and_nans, 0).fillna(0)

Consider implementing it on it's own line with a comment explaining why that fill is necessary. Also find out what edge case it solves and write a test for it.

precon/precon/aggregation.py

Line 68 in cf0df3a

weights = weights.mask(indices.isna() | indices.eq(0), 0)

Chang to applying _get_adjustments in round_and_adjust function

Change from the following:

    elif isinstance(obj, pd.core.frame.DataFrame):

        # Create an empty DataFrame to fill with adjustments
        adjustments = pd.DataFrame().reindex_like(obj)

        for index, row in iter_method(obj):
            # Create a selector based on the axis
            slice_ = axis_slice(index, axis)

            adjustments.loc[slice_] = _get_adjustments(row, decimals)

to this:

    elif isinstance(obj, pd.core.frame.DataFrame):

        adjustments = obj.apply(_get_adjustments, args=(decimals), axis=axis)

This should also allow for the removal of:

    iter_dict = {
        0: pd.DataFrame.iterrows,
        1: pd.DataFrame.iteritems,
    }
    iter_method = iter_dict.get(axis)

Slimming the function right down.

While taking care of this, remember to also do the following:

- Ensure empty line at EOF
- change the isinstance calls so that we're removing core.Series./core.Frame

Add base_prices use_first option which resamples annually picking first.

Similar to Matt's implementation here:

in_year_base = indices.resample('AS').first()

    # Align base indices to full time series values
    in_year_base = (
        in_year_base
        .reindex_like(indices, method='ffill')

Modify get_base_prices to only fill within year

This might need some generalisation later on, but replace what is there for now. Maybe this function can move too, index_methods? Move index_calculator there too?

precon/precon/imputation.py

Lines 143 to 152 in ea185fa

    
           def get_base_prices( 
        
                   prices: pd.DataFrame, 
        
                   base_period: int = 1, 
        
                   axis: pd._typing.Axis = 0, 
        
                   ffill: bool = True, 
        
                   ) -> pd.DataFrame: 
        
               """Returns the prices at the base month in the same shape as prices. 
        
               Default behaviour is to fill forward values, but can be changed to 
        
               return NaN where not base_month by setting ffill=False.

Add .fillna(base_prices) in index_calculator function to cover NaNs from shift

precon/precon/pipelines.py

Line 80 in cf0df3a

base_prices = base_prices.shift(1, axis=axis)

Add a .fillna(base_prices) method to cover the NaNs created by the shift.

Be mindful that this is changing in impute_base_prices too, but it's covered their already with the .fillna(start_prices).

dropna in jan_adjustment will remove all values in row

precon/precon/adjustments.py

Line 24 in 4e441a7

adjusted = adjusted.dropna()

The above line in the function means if passing in a dataframe with the following format

date	col1	col2
2019-01-01	101	NaN
...	...	...
2019-05-01	104	NaN
2019-06-01	103	100
...	...	...
2020-01-01	101	102

(i.e. col2 timeseries starts later than col1 ) then jan_adjustment will drop the entire row for 2019-01-01.

Not sure on the correct behaviour, but anecdotally removing the dropna seems to work well.

Simple base price getting
Base price imputation
Already passed base prices

Consider a sensible way of implementing this - might need some tests first!

Add random index generator for test data

Create a generator to create random index data in a reproducible way.

Support the generation of hierarchical structure of indices.

Add pre-commit hooks for devs

I want to add some pre-commit hooks for developers.

Remove whitespace
Flake 8 linting
Check commit msg subject len

Bug: change to isin

There's a bug here, since base_period is a list rather than a single int. Change to isin() method.

precon/precon/imputation.py

Line 191 in 4e441a7

base_prices = prices.where(months.eq(base_period))

Add base_prices * adjustments if adjustments passed

precon/precon/pipelines.py

Line 77 in 823c0c1

base_prices = get_base_prices(prices, base_period, axis)

Add this after:

    if adjustments is not None:         
        base_prices = base_prices * adjustments

And update docstring.

Documentation

It would be useful to be able to view the docs for this project.

Currently, I think, you have to clone and build them yourself?

A solution would be to use GitHub pages to serve the docs as this works well with sphinx.

	def get_base_prices(
	prices: pd.DataFrame,
	base_period: int = 1,
	axis: pd._typing.Axis = 0,
	ffill: bool = True,
	) -> pd.DataFrame:
	"""Returns the prices at the base month in the same shape as prices.

	Default behaviour is to fill forward values, but can be changed to
	return NaN where not base_month by setting ffill=False.

onsbigdata / precon Goto Github PK

precon's People

Contributors

Stargazers

Watchers

Forkers

precon's Issues

Recommend Projects

Recommend Topics

Recommend Org