Giter Site home page Giter Site logo

Comments (7)

epogrebnyak avatar epogrebnyak commented on July 22, 2024 1

The proper check for year-avergae rate accumaulation is as below. Thanks @zarak for spotting the 'mutation' error!

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""1999-01-31 81.7
1999-02-28 96.9
1999-03-31 106.0
1999-04-30 97.6
1999-05-31 100.2
1999-06-30 100.7
1999-07-31 100.0
1999-08-31 106.5
1999-09-30 100.5
1999-10-31 102.1
1999-11-30 100.5
1999-12-31 116.0
2000-01-31 83.3
2000-02-29 97.9
2000-03-31 105.9
2000-04-30 98.2
2000-05-31 99.6
2000-06-30 101.1
2000-07-31 101.4
2000-08-31 105.6
2000-09-30 100.1
2000-10-31 102.2
2000-11-30 101.4
2000-12-31 115.6"""), sep = ' ', header = None, names = ['date', 'X_rog'],
index_col = 'date', converters=dict(date=pd.to_datetime))

df = df / 100
df = df.cumprod()
z = df.resample('A').sum()
rate = (z / z.shift() * 100).round(1).dropna()

assert rate.loc['2000',].iloc[0,0] == 109.0

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024 1

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024

Moved workfile to /issues/todo_df_check.py

from parser-rosstat-kep.

bolivar1997 avatar bolivar1997 commented on July 22, 2024

9e9ce85

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024

some of my dfa yoy vs cumprod of dfm rog for RETAIL_SALES:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""1999-01-31 81.7
1999-02-28 96.9
1999-03-31 106.0
1999-04-30 97.6
1999-05-31 100.2
1999-06-30 100.7
1999-07-31 100.0
1999-08-31 106.5
1999-09-30 100.5
1999-10-31 102.1
1999-11-30 100.5
1999-12-31 116.0
2000-01-31 83.3
2000-02-29 97.9
2000-03-31 105.9
2000-04-30 98.2
2000-05-31 99.6
2000-06-30 101.1
2000-07-31 101.4
2000-08-31 105.6
2000-09-30 100.1
2000-10-31 102.2
2000-11-30 101.4
2000-12-31 115.6"""), sep = ' ', header = None, names = ['date', 'X_rog'],
index_col = 'date', converters=dict(date=pd.to_datetime))

df = df / 100
df.cumprod()
z = df.resample('A').sum()
rate =  z.iloc[1,0] / z.iloc[0,0]
#1.0029784065524945

#dfa.RETAIL_SALES_yoy
#Out[41]: 
#1999-12-31     94.2
#2000-12-31    109.0

from parser-rosstat-kep.

zarak avatar zarak commented on July 22, 2024

89b76da It may be preferable to use fillna(0) instead of dropna().

Consider the following aggregated dataframe:

In [208]: aggregate_rates_to_annual_average(df1)
Out[208]: 
            INDPRO_rog  INVESTMENT_rog  RETAIL_SALES_FOOD_rog  \
1999-12-31         NaN             NaN                    NaN   
2000-12-31         NaN      117.364097             107.405475   

            RETAIL_SALES_NONFOOD_rog  RETAIL_SALES_rog  WAGE_REAL_rog  
1999-12-31                       NaN               NaN            NaN  
2000-12-31                 110.50661        109.006228     120.181979  

Using dropna() on this returns an empty dataframe because by default any row or column with an NaN is dropped. We could use dropna(how='all'):

In [212]: aggregate_rates_to_annual_average(df1).dropna(how='all')
Out[212]: 
            INDPRO_rog  INVESTMENT_rog  RETAIL_SALES_FOOD_rog  \
2000-12-31         NaN      117.364097             107.405475   

            RETAIL_SALES_NONFOOD_rog  RETAIL_SALES_rog  WAGE_REAL_rog  
2000-12-31                 110.50661        109.006228     120.181979  

but the remaining NaN value will evaluate to False against the threshold unless it is dropped too.

In [213]: aggregate_rates_to_annual_average(df1).dropna(how='all') < 150
Out[213]: 
           INDPRO_rog INVESTMENT_rog RETAIL_SALES_FOOD_rog  \
2000-12-31      False           True                  True   

           RETAIL_SALES_NONFOOD_rog RETAIL_SALES_rog WAGE_REAL_rog  
2000-12-31                     True             True          True  

Or is evaluating to False here the expected outcome?

from parser-rosstat-kep.

epogrebnyak avatar epogrebnyak commented on July 22, 2024

Some takeaways for this issue.

We had a task to check incoming dfa, dfq, dfm. This task has stages of:

  • df-level primitives. like accum* function
  • setup test arguments for a single 'resolution' function
  • get result for the setup
  • optionally calculate coverage

With this we did:
a) write and test the primitives
b) deciding what to check and prepare variables to be checked
c) setting up feed of tests, digestible by a resolution function
d) running a feed of checks to get a pass/fail result for the test suite
e) knowing 'coverage' of checks - what was not tested?

My lessons learned are:

  • primities are testable with unittests and simple input values
  • good when you arrive to a common formula for the primitives (like accum(df1)-df2 < epsilon)
  • primitives should be clearly separated from setup
  • sacrifices had to be made not to test everything

In #66 df_transform() separating primitives from transform job is also important, but setting teh varnames was more definite - we exactly know want variables we transform.

@zarak, your comment is welcome!

We leave it to rest for a while before integrating to parsing validation procedure.

from parser-rosstat-kep.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.