In src/exmple2.py will make an entry point for checks on resulting dataframes

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

some of my dfa yoy vs cumprod of dfm rog for RETAIL_SALES: <div class="highlight h

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

check resulting dataframes for consistency about parser-rosstat-kep HOT 7 CLOSED

mini-kep commented on July 22, 2024

check resulting dataframes for consistency

from parser-rosstat-kep.

Comments (7)

epogrebnyak commented on July 22, 2024 1

The proper check for year-avergae rate accumaulation is as below. Thanks @zarak for spotting the 'mutation' error!

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""1999-01-31 81.7
1999-02-28 96.9
1999-03-31 106.0
1999-04-30 97.6
1999-05-31 100.2
1999-06-30 100.7
1999-07-31 100.0
1999-08-31 106.5
1999-09-30 100.5
1999-10-31 102.1
1999-11-30 100.5
1999-12-31 116.0
2000-01-31 83.3
2000-02-29 97.9
2000-03-31 105.9
2000-04-30 98.2
2000-05-31 99.6
2000-06-30 101.1
2000-07-31 101.4
2000-08-31 105.6
2000-09-30 100.1
2000-10-31 102.2
2000-11-30 101.4
2000-12-31 115.6"""), sep = ' ', header = None, names = ['date', 'X_rog'],
index_col = 'date', converters=dict(date=pd.to_datetime))

df = df / 100
df = df.cumprod()
z = df.resample('A').sum()
rate = (z / z.shift() * 100).round(1).dropna()

assert rate.loc['2000',].iloc[0,0] == 109.0

from parser-rosstat-kep.

epogrebnyak commented on July 22, 2024 1

Basically we used dropna() to shut down first row, as there is nothing to divide by. A more correct way is to drop just this first row and not to use dropna or fill(0) for this. dropna can result in empty df and we do not want this. In case of variable like INDPRO there is a break in series, new data is available just for few recent years. We may do a separate test for it. If we need something to fill dataframe in other cases, fillna(0) seems more appropruate as it guarantees to preserve the df size. Thanks for documenting this! 4 сент. 2017 г. 1:03 пользователь "Zarak" <[email protected]> написал:

…

89b76da <89b76da> It may be preferable to use fillna(0) instead of dropna(). Consider the following aggregated dataframe: In [208]: aggregate_rates_to_annual_average(df1) Out[208]: INDPRO_rog INVESTMENT_rog RETAIL_SALES_FOOD_rog \1999-12-31 NaN NaN NaN 2000-12-31 NaN 117.364097 107.405475 RETAIL_SALES_NONFOOD_rog RETAIL_SALES_rog WAGE_REAL_rog 1999-12-31 NaN NaN NaN 2000-12-31 110.50661 109.006228 120.181979 Using dropna() on this returns an empty dataframe because by default any row or column with an NaN is dropped. We could use dropna(how='all'): In [212]: aggregate_rates_to_annual_average(df1).dropna(how='all') Out[212]: INDPRO_rog INVESTMENT_rog RETAIL_SALES_FOOD_rog \2000-12-31 NaN 117.364097 107.405475 RETAIL_SALES_NONFOOD_rog RETAIL_SALES_rog WAGE_REAL_rog 2000-12-31 110.50661 109.006228 120.181979 but the remaining NaN value will evaluate to False against the threshold unless it is dropped too. Or is evaluating to False here the expected outcome? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#61 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AI1grp5oAp2_ey2EVtmvzEwNh6g5DGVzks5seyIigaJpZM4OwqGK> .

from parser-rosstat-kep.

epogrebnyak commented on July 22, 2024

Moved workfile to /issues/todo_df_check.py

from parser-rosstat-kep.

bolivar1997 commented on July 22, 2024

9e9ce85

from parser-rosstat-kep.

epogrebnyak commented on July 22, 2024

some of my dfa yoy vs cumprod of dfm rog for RETAIL_SALES:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""1999-01-31 81.7
1999-02-28 96.9
1999-03-31 106.0
1999-04-30 97.6
1999-05-31 100.2
1999-06-30 100.7
1999-07-31 100.0
1999-08-31 106.5
1999-09-30 100.5
1999-10-31 102.1
1999-11-30 100.5
1999-12-31 116.0
2000-01-31 83.3
2000-02-29 97.9
2000-03-31 105.9
2000-04-30 98.2
2000-05-31 99.6
2000-06-30 101.1
2000-07-31 101.4
2000-08-31 105.6
2000-09-30 100.1
2000-10-31 102.2
2000-11-30 101.4
2000-12-31 115.6"""), sep = ' ', header = None, names = ['date', 'X_rog'],
index_col = 'date', converters=dict(date=pd.to_datetime))

df = df / 100
df.cumprod()
z = df.resample('A').sum()
rate =  z.iloc[1,0] / z.iloc[0,0]
#1.0029784065524945

#dfa.RETAIL_SALES_yoy
#Out[41]: 
#1999-12-31     94.2
#2000-12-31    109.0

from parser-rosstat-kep.

zarak commented on July 22, 2024

89b76da It may be preferable to use fillna(0) instead of dropna().

Consider the following aggregated dataframe:

In [208]: aggregate_rates_to_annual_average(df1)
Out[208]: 
            INDPRO_rog  INVESTMENT_rog  RETAIL_SALES_FOOD_rog  \
1999-12-31         NaN             NaN                    NaN   
2000-12-31         NaN      117.364097             107.405475   

            RETAIL_SALES_NONFOOD_rog  RETAIL_SALES_rog  WAGE_REAL_rog  
1999-12-31                       NaN               NaN            NaN  
2000-12-31                 110.50661        109.006228     120.181979

Using dropna() on this returns an empty dataframe because by default any row or column with an NaN is dropped. We could use dropna(how='all'):

In [212]: aggregate_rates_to_annual_average(df1).dropna(how='all')
Out[212]: 
            INDPRO_rog  INVESTMENT_rog  RETAIL_SALES_FOOD_rog  \
2000-12-31         NaN      117.364097             107.405475   

            RETAIL_SALES_NONFOOD_rog  RETAIL_SALES_rog  WAGE_REAL_rog  
2000-12-31                 110.50661        109.006228     120.181979

but the remaining NaN value will evaluate to False against the threshold unless it is dropped too.

In [213]: aggregate_rates_to_annual_average(df1).dropna(how='all') < 150
Out[213]: 
           INDPRO_rog INVESTMENT_rog RETAIL_SALES_FOOD_rog  \
2000-12-31      False           True                  True   

           RETAIL_SALES_NONFOOD_rog RETAIL_SALES_rog WAGE_REAL_rog  
2000-12-31                     True             True          True

Or is evaluating to False here the expected outcome?

from parser-rosstat-kep.

epogrebnyak commented on July 22, 2024

Some takeaways for this issue.

We had a task to check incoming dfa, dfq, dfm. This task has stages of:

df-level primitives. like accum* function
setup test arguments for a single 'resolution' function
get result for the setup
optionally calculate coverage

With this we did:
a) write and test the primitives
b) deciding what to check and prepare variables to be checked
c) setting up feed of tests, digestible by a resolution function
d) running a feed of checks to get a pass/fail result for the test suite
e) knowing 'coverage' of checks - what was not tested?

My lessons learned are:

primities are testable with unittests and simple input values
good when you arrive to a common formula for the primitives (like accum(df1)-df2 < epsilon)
primitives should be clearly separated from setup
sacrifices had to be made not to test everything

In #66 df_transform() separating primitives from transform job is also important, but setting teh varnames was more definite - we exactly know want variables we transform.

@zarak, your comment is welcome!

We leave it to rest for a while before integrating to parsing validation procedure.

from parser-rosstat-kep.

check resulting dataframes for consistency about parser-rosstat-kep HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent