Giter Site home page Giter Site logo

parsers's People

Contributors

andres-unt avatar bhumkong avatar epogrebnyak avatar eskarimov avatar jarovojtek avatar kkravchuk avatar muroslav2909 avatar mwangikinuthia avatar perevedko avatar rub4ek avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

parsers's Issues

Evaluate scheduler options

По поводу шедулера - есть два варианта https://devcenter.heroku.com/articles/scheduler и https://devcenter.heroku.com/articles/clock-processes-python

  1. Просто шедулер. Запуск строго или каждые 10 минут, или каждый час, или каждый день. По сути команда heroku run ... В нашем случае это heroku run python <parser_name>. Вопрос - команды запуска шедулеров тогда в один файл класть или по файлам раскидывать и столько же шедулеров создавать?

  2. APS Scheduler. Создаётся процесс типа clock в Heroku, запускается файл с расписанием, когда какой парсер дёргать. Для описания используется python библиотека APSScheduler. Плюс в том, что можно более гибко настроить расписание

new CBR_USD parser class

  • make CBR_USD(Parser) class in definitions.py
  • add some new attributes to CBR_USD and RosstatKEP
  • (optionally) add mock output to CBR_USD in get_data()
  • generate two md tables for CBR_USD and RosstatKEP

Time: 1-2h.

refactor inherited classes

# TODO: use parts of code belwo if needed for validate_datapoint()
class Base_Test_Parser:
def setup_method(self):
#must overload this
self.parser = None
self.items = None
def test_get_data_members_are_length_4(self):
for item in self.items:
assert len(item) == 4
def test_get_produces_data_of_correct_types(self):
for item in self.items:
assert isinstance(item['date'], str)
assert isinstance(item['freq'], str)
assert isinstance(item['name'], str)
assert isinstance(item['value'], float)
def test_get_data_item_date_in_valid_format(self):
dates = (item['date'] for item in self.items)
for date in dates:
assert arrow.get(date)
def test_get_data_item_date_in_valid_range(self):
dates = (item['date'] for item in self.items)
for date in dates:
date = arrow.get(date).date()
# EP: splitting to see what exactly fails
assert self.parser.start <= date
assert date <= datetime.date.today()
# valid code and good idea to check, but iplementation too copmplex
# for a base test class
def test_get_data_produces_values_in_valid_range(self, items, min, max):
for item in items:
assert min < item['value'] < max
# ------------------------ end of datapoitn validation
class TestRosstatKep(Base_Test_Parser):
def setup_method(self):
self.parser = RosstatKEP_Monthly()
self.items = list(self.parser.sample())
def test_get_data_produces_values_in_valid_range(self):
cpi_data = [item for item in self.items
if item['name'] == 'CPI_rog']
eur_data = [item for item in self.items
if item['name'] == 'RUR_EUR_eop']
super(TestRosstatKep, self)\
.test_get_data_produces_values_in_valid_range(cpi_data, 90, 110)
super(TestRosstatKep, self)\
.test_get_data_produces_values_in_valid_range(eur_data, 50, 80)
def test_start_date_is_correct(self):
assert self.parser.start == arrow.get('1999-01-31').date()
def test_source_url_is_correct(self):
assert self.parser.reference.get('source_url') == \
("http://www.gks.ru/wps/wcm/connect/"
"rosstat_main/rosstat/ru/statistics/"
"publications/catalog/doc_1140080765391")
class TestCBR_USD(Base_Test_Parser):
def setup_method(self):
self.parser = CBR_USD()
self.items = list(self.parser.sample())
def test_get_data_produces_values_in_valid_range(self):
super(TestCBR_USD, self).test_get_data_produces_values_in_valid_range(self.items, 50, 70)
def test_start_date_is_correct(self):
assert self.parser.start == datetime.date(1992, 1, 1)
def test_source_url_is_correct(self):
assert self.parser.reference.get('source_url') == ("http://www.cbr.ru/"
"scripts/Root.asp?PrtId=SXML")
def test_all_varnames_are_correct(self):
assert self.parser.reference.get('varnames') == ['USDRUR_CB']
class TestBrentEIA(Base_Test_Parser):
def setup_method(self):
self.parser = BrentEIA()
self.items = list(self.parser.sample())
def test_get_data_produces_values_in_valid_range(self):
super(TestBrentEIA, self).test_get_data_produces_values_in_valid_range(self.items, 20, 120)
def test_start_date_is_correct(self):
assert self.parser.start == datetime.date(1987, 5, 15)
def test_source_url_is_correct(self):
assert self.parser.reference.get('source_url') == ("https://www.eia.gov/opendata/"
"qb.php?category=241335")
def test_all_varnames_are_correct(self):
assert self.parser.reference.get('varnames') == ['BRENT']

Refactroing result is ususally parametrised tests.

test fails locally, but not on Travis

================================== FAILURES ===================================
__________________ Test_make_date.test_on_none_returns_today __________________

self = <parsers.tests.test_helpers.Test_make_date object at 0x0000016A99BDA128>

    def test_on_none_returns_today(self):
>       assert DateHelper.make_date(None) == datetime.date.today()
E       AssertionError: assert datetime.date(2017, 9, 29) == datetime.date(2017, 9, 30)
E        +  where datetime.date(2017, 9, 29) = <function DateHelper.make_date at 0x0000016A9881ABF8>(None)
E        +    where <function DateHelper.make_date at 0x0000016A9881ABF8> = DateHelper.make_date
E        +  and   datetime.date(2017, 9, 30) = <built-in method today of type object at 0x00000000609F6720>()
E        +    where <built-in method today of type object at 0x00000000609F6720> = <class 'datetime.date'>.today
E        +      where <class 'datetime.date'> = datetime.date

tests\test_helpers.py:33: AssertionError
===================== 1 failed, 44 passed in 4.11 seconds =====================

upload fails on large datasets

from parsers.runner import CBR_USD

# this will pass
assert CBR_USD('2017-01-01').upload()

# this will fail
assert CBR_USD().upload()

Change folder structure

  • parsers
  • parsers.runner.Parsers (collection)
  • parsers.getter.kep/brent/cbr_fx
  • use di in getters
  • markdown()

feature checklist / parser summary as json

There is a format to deliever information about parser development status for kep and cbr-usd:

I want to make a configuration dictionary with these parameters, like

desc = dict(name='rosstat-kep', 
    freq='m', 
    text='Parse sections of ' 
             'Short-term Economic Indicators (KEP)'
             'monthly Rosstat publication')

The converter function should map this dictionary to markdown string.

chain years in 'ust' parser

# IDEA: maybe add a vaidation decorator for all getter fucntions?
def get_ust_dict(start_date, downloader=util.fetch):
"""Return UST datapoints as list of dictionaries, based on *start_date*."""
year = make_year(start_date)
url = make_url(year)
content = downloader(url)
return parse_xml(content)
# ERROR: an start_Date - loads just one year, not all years from that date on

make upload scenarios

Upload scenarios cover a batch of jobs that uploader should do. So far they are:

  • kep (monthly)
  • ust, brent, cbr_fx(daily)

ust: April 14 has 0 values

<entry><id>http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(6832)</id><title type="text"/><updated>2017-11-08T21:11:43Z</updated><author><name/></author><link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6832)"/><category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/><content type="application/xml"><m:properties><d:Id m:type="Edm.Int32">6832</d:Id><d:NEW_DATE m:type="Edm.DateTime">2017-04-14T00:00:00</d:NEW_DATE><d:BC_1MONTH m:type="Edm.Double">0</d:BC_1MONTH><d:BC_3MONTH m:type="Edm.Double">0</d:BC_3MONTH><d:BC_6MONTH m:type="Edm.Double">0</d:BC_6MONTH><d:BC_1YEAR m:type="Edm.Double">0</d:BC_1YEAR><d:BC_2YEAR m:type="Edm.Double">0</d:BC_2YEAR><d:BC_3YEAR m:type="Edm.Double">0</d:BC_3YEAR><d:BC_5YEAR m:type="Edm.Double">0</d:BC_5YEAR><d:BC_7YEAR m:type="Edm.Double">0</d:BC_7YEAR><d:BC_10YEAR m:type="Edm.Double">0</d:BC_10YEAR><d:BC_20YEAR m:type="Edm.Double">0</d:BC_20YEAR><d:BC_30YEAR m:type="Edm.Double">0</d:BC_30YEAR><d:BC_30YEARDISPLAY m:type="Edm.Double">0</d:BC_30YEARDISPLAY></m:properties></content></entry>

this day should be empty

consider merging two layers of classes

Currently there is a class to handle imput parameters and upoload in runner.py and getter functions/classes
in getters submodule. Probably they can be just one class.

Can add thin wrappers for string date handling + uploader functionality.

#TODO: merge ParserBase and getter.ust.Getter classes

error reading XML with m:null="true"

<entry xmlns="http://www.w3.org/2005/Atom">
<id>http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(3458)</id>
<title type="text"></title><updated>2017-11-07T14:01:54Z</updated>
<author><name /></author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(3458)" />
<category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" /><content type="application/xml"><m:properties xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"><d:Id m:type="Edm.Int32" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">3458</d:Id><d:NEW_DATE m:type="Edm.DateTime" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">2010-10-11T00:00:00</d:NEW_DATE><d:BC_1MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_3MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_6MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_1YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_2YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_3YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_5YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_7YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_10YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_20YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_30YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_30YEARDISPLAY m:type="Edm.Double" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">0</d:BC_30YEARDISPLAY></m:properties></content></entry>

what information should we collect about the parsers?

Following issue #1: we have hand-made summary tables in markdown about parser information:

In <definitons.py> we have a way to keep parser information in class and show it to the user as markdown. Some of the summary information about the parser is used as parameters for its invokation.

This issue will result in defining what information about the parsers is needed to run them and to present their summary to the user.

Current attribures are (from here):

class RosstatKEP(Parser):
    name = 'rosstat-kep'
    does_what = 'Parse sections of KEP Rosstat publication'
    freqs = 'aqm'    
    all_varnames = ['CPI_rog', 'RUR_EUR_eop']
    start_date = make_date('1999-01-31')

The issue should result in new parser summaries for all parsers listed here.

The information we specify for each parser class is used to a) bound the call to the parser (eg available variables, frequencies, dat limits) and b) make description more understandable.

housekeeping

Finishing tests with FIXME

Housekeeping:

  • add ust to runner.py
  • change readme.md with newer tables and edit of md in parsers list
  • #17?

Bigger issues:

  • runner.py: on a given date must provide a dataset to write to database
  • new parser: parse rosstat-isep from word or pdf

need cover getter.* modules

There are no tests for:

  • brent.py
  • cbr_fx.py
  • kep.py

Ideally, need to keep tests simple/readable and closer to real, not imaginary risks in the program.

Also, best to keep in mind what tests are:

  1. unittests that control functionality program structure
  2. tests that validate functionality results
  3. integration/end-to-end test
  4. regression/bug fixes

Everyone confortable with these ideas/defintions, @Andres-Unt, @muroslav2909 , @JaroVojtek?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.