The parsers from mini-kep

update tests in test_parser.py

@kkravchuk : note some refactoring needed for tests, marked as TODO

Can we also add test for to_markdown.py, something very simple?

later: badges to list all variables available for parsing

After that may set up badges like:

also: publications, sources?

Evaluate scheduler options

По поводу шедулера - есть два варианта https://devcenter.heroku.com/articles/scheduler и https://devcenter.heroku.com/articles/clock-processes-python

Просто шедулер. Запуск строго или каждые 10 минут, или каждый час, или каждый день. По сути команда heroku run ... В нашем случае это heroku run python <parser_name>. Вопрос - команды запуска шедулеров тогда в один файл класть или по файлам раскидывать и столько же шедулеров создавать?
APS Scheduler. Создаётся процесс типа clock в Heroku, запускается файл с расписанием, когда какой парсер дёргать. Для описания используется python библиотека APSScheduler. Плюс в том, что можно более гибко настроить расписание

add repo URL to parser descriptions

tests class for upload_to_database()

tests class for upload_to_database():

write in comments what scenarios you suggest to test
write in comments how you plan to test for it, eg. actual call or some mocking
time permititng, implement some or all proposals in actual test code

pls remember the testing guidelines

new CBR_USD parser class

make CBR_USD(Parser) class in definitions.py
add some new attributes to CBR_USD and RosstatKEP
(optionally) add mock output to CBR_USD in get_data()
generate two md tables for CBR_USD and RosstatKEP

Time: 1-2h.

create the setup script

Need a setup script so we can easily use it as a dependency.
Please create it or review this PR: #22

separate test suite to local and web call tests

What are pytest capabilities to run a specific group of tests?

Review http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/pytestmark.html + provide example how we can run:

all local tests
all webtests
both

change Dataset.yield_dicts() in readme

we can now have a start_date

refactor inherited classes

parsers/parsers/tests/test_runner.py

Lines 129 to 233 in 268c1ab

    
           # TODO: use parts of code belwo if needed for validate_datapoint() 
        
           class Base_Test_Parser: 
        
              def setup_method(self): 
        
                  #must overload this 
        
                  self.parser = None 
        
                  self.items = None 
        
              def test_get_data_members_are_length_4(self): 
        
                  for item in self.items: 
        
                      assert len(item) == 4 
        
              def test_get_produces_data_of_correct_types(self): 
        
                  for item in self.items: 
        
                      assert isinstance(item['date'], str) 
        
                      assert isinstance(item['freq'], str) 
        
                      assert isinstance(item['name'], str) 
        
                      assert isinstance(item['value'], float) 
        
              def test_get_data_item_date_in_valid_format(self): 
        
                  dates = (item['date'] for item in self.items) 
        
                  for date in dates: 
        
                      assert arrow.get(date) 
        
              def test_get_data_item_date_in_valid_range(self): 
        
                  dates = (item['date'] for item in self.items) 
        
                  for date in dates: 
        
                      date = arrow.get(date).date() 
        
                      # EP: splitting to see what exactly fails 
        
                      assert self.parser.start <= date 
        
                      assert date <= datetime.date.today() 
        
              # valid code and good idea to check, but iplementation too copmplex 
        
              # for a base test class 
        
              def test_get_data_produces_values_in_valid_range(self, items, min, max): 
        
                 for item in items: 
        
                     assert min < item['value'] < max 
        
           # ------------------------   end of datapoitn validation 
        
           class TestRosstatKep(Base_Test_Parser): 
        
              def setup_method(self): 
        
                  self.parser = RosstatKEP_Monthly() 
        
                  self.items = list(self.parser.sample()) 
        
              def test_get_data_produces_values_in_valid_range(self): 
        
                 cpi_data = [item for item in self.items 
        
                             if item['name'] == 'CPI_rog'] 
        
                 eur_data = [item for item in self.items 
        
                             if item['name'] == 'RUR_EUR_eop'] 
        
                 super(TestRosstatKep, self)\ 
        
                     .test_get_data_produces_values_in_valid_range(cpi_data, 90, 110) 
        
                 super(TestRosstatKep, self)\ 
        
                     .test_get_data_produces_values_in_valid_range(eur_data, 50, 80) 
        
              def test_start_date_is_correct(self): 
        
                  assert self.parser.start == arrow.get('1999-01-31').date() 
        
              def test_source_url_is_correct(self): 
        
                  assert self.parser.reference.get('source_url')  == \ 
        
                         ("http://www.gks.ru/wps/wcm/connect/" 
        
                         "rosstat_main/rosstat/ru/statistics/" 
        
                         "publications/catalog/doc_1140080765391") 
        
           class TestCBR_USD(Base_Test_Parser): 
        
              def setup_method(self): 
        
                  self.parser = CBR_USD() 
        
                  self.items = list(self.parser.sample()) 
        
              def test_get_data_produces_values_in_valid_range(self): 
        
                  super(TestCBR_USD, self).test_get_data_produces_values_in_valid_range(self.items, 50, 70) 
        
              def test_start_date_is_correct(self): 
        
                  assert self.parser.start == datetime.date(1992, 1, 1) 
        
              def test_source_url_is_correct(self): 
        
                  assert self.parser.reference.get('source_url')  == ("http://www.cbr.ru/" 
        
                                                                      "scripts/Root.asp?PrtId=SXML") 
        
              def test_all_varnames_are_correct(self): 
        
                  assert self.parser.reference.get('varnames') == ['USDRUR_CB'] 
        
           class TestBrentEIA(Base_Test_Parser): 
        
              def setup_method(self): 
        
                  self.parser = BrentEIA() 
        
                  self.items = list(self.parser.sample()) 
        
              def test_get_data_produces_values_in_valid_range(self): 
        
                  super(TestBrentEIA, self).test_get_data_produces_values_in_valid_range(self.items, 20, 120) 
        
              def test_start_date_is_correct(self): 
        
                  assert self.parser.start == datetime.date(1987, 5, 15) 
        
              def test_source_url_is_correct(self): 
        
                  assert self.parser.reference.get('source_url')  == ("https://www.eia.gov/opendata/" 
        
                                                                      "qb.php?category=241335") 
        
              def test_all_varnames_are_correct(self): 
        
                  assert self.parser.reference.get('varnames') == ['BRENT']

Refactroing result is ususally parametrised tests.

test fails locally, but not on Travis

================================== FAILURES ===================================
__________________ Test_make_date.test_on_none_returns_today __________________

self = <parsers.tests.test_helpers.Test_make_date object at 0x0000016A99BDA128>

    def test_on_none_returns_today(self):
>       assert DateHelper.make_date(None) == datetime.date.today()
E       AssertionError: assert datetime.date(2017, 9, 29) == datetime.date(2017, 9, 30)
E        +  where datetime.date(2017, 9, 29) = <function DateHelper.make_date at 0x0000016A9881ABF8>(None)
E        +    where <function DateHelper.make_date at 0x0000016A9881ABF8> = DateHelper.make_date
E        +  and   datetime.date(2017, 9, 30) = <built-in method today of type object at 0x00000000609F6720>()
E        +    where <built-in method today of type object at 0x00000000609F6720> = <class 'datetime.date'>.today
E        +      where <class 'datetime.date'> = datetime.date

tests\test_helpers.py:33: AssertionError
===================== 1 failed, 44 passed in 4.11 seconds =====================

add database upload methods

See TODO in ParserBase.upload() and Dataset.upload()

enhancement: add attempts to verbose upload results

Example:

'Uploaded 350 datapoints in 0.12 seconds in 1 attempt(s)'

daily upload job for 'brent' и 'cbr_fx' on separate heroku dyno

consider scrapy for parsers pipeline

https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb

move delete.py and manage.py to scripts folder

add test on writing to temp file

https://codecov.io/gh/mini-kep/parsers/src/master/parsers/dataset.py#L27...28

exclude 'tasks.py' from code coverage

upload latest KEP data and check

upload fails on large datasets

from parsers.runner import CBR_USD

# this will pass
assert CBR_USD('2017-01-01').upload()

# this will fail
assert CBR_USD().upload()

add classes and more testing in uploader.py

Sender class
Uploader class
tests for individual functions in uploader.py, difficulties: use of mocks / injection

finalise test_helpers.py

add start argument

gen = Dataset.yield_dicts(start='2017-09-01')

Make Dataset inherit from BaseParser

sort out commented code in test_runner.py

Use a little part of the code in previous tests, delete rest of it:

https://github.com/mini-kep/parsers/blob/master/parsers/tests/test_runner.py#L121-L276

make BrentEIA class

@kkravchuk, please note it should be based on https://github.com/epogrebnyak/data-fx-oil/blob/master/brent.py, which is newer than https://github.com/epogrebnyak/data-fx-oil/blob/master/eia.py.

Please do not use eia.py.

Change folder structure

parsers
parsers.runner.Parsers (collection)
parsers.getter.kep/brent/cbr_fx
use di in getters
markdown()

feature checklist / parser summary as json

There is a format to deliever information about parser development status for kep and cbr-usd:

I want to make a configuration dictionary with these parameters, like

desc = dict(name='rosstat-kep', 
    freq='m', 
    text='Parse sections of ' 
             'Short-term Economic Indicators (KEP)'
             'monthly Rosstat publication')

The converter function should map this dictionary to markdown string.

should we round/beautify 349.89999999999998?

RosstatKEP_* datatsets have quite many float numbers, most with long representation. Any common method to round/beautify values like 349.89999999999998?

chain years in 'ust' parser

parsers/parsers/getter/ust.py

Lines 73 to 81 in 9316e18

    
           # IDEA: maybe add a vaidation decorator for all getter fucntions? 
        
           def get_ust_dict(start_date, downloader=util.fetch): 
        
               """Return UST datapoints as list of dictionaries, based on *start_date*.""" 
        
               year = make_year(start_date) 
        
               url = make_url(year) 
        
               content = downloader(url) 
        
               return parse_xml(content) 
        
           # ERROR: an start_Date - loads just one year, not all years from that date on

rename items to datapoints?

testing (discussion): using patching or dependency injection in tests

Code tested: https://github.com/mini-kep/parsers/blob/master/parsers/getter/brent.py

Note brent.py has dependency injection at def yield_brent_dicts(download_func=fetch):

This should be used in designing the test for yield_brent_dicts() - need a mock fucntion to test yield_brent_dicts().

Same should be done in other getters.

add delete method or class with token

After done, delete one datapoint in #52

delete ust 30YEARDISPLAY from database

make upload scenarios

Upload scenarios cover a batch of jobs that uploader should do. So far they are:

kep (monthly)
ust, brent, cbr_fx(daily)

ust: April 14 has 0 values

<entry><id>http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(6832)</id><title type="text"/><updated>2017-11-08T21:11:43Z</updated><author><name/></author><link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(6832)"/><category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme"/><content type="application/xml"><m:properties><d:Id m:type="Edm.Int32">6832</d:Id><d:NEW_DATE m:type="Edm.DateTime">2017-04-14T00:00:00</d:NEW_DATE><d:BC_1MONTH m:type="Edm.Double">0</d:BC_1MONTH><d:BC_3MONTH m:type="Edm.Double">0</d:BC_3MONTH><d:BC_6MONTH m:type="Edm.Double">0</d:BC_6MONTH><d:BC_1YEAR m:type="Edm.Double">0</d:BC_1YEAR><d:BC_2YEAR m:type="Edm.Double">0</d:BC_2YEAR><d:BC_3YEAR m:type="Edm.Double">0</d:BC_3YEAR><d:BC_5YEAR m:type="Edm.Double">0</d:BC_5YEAR><d:BC_7YEAR m:type="Edm.Double">0</d:BC_7YEAR><d:BC_10YEAR m:type="Edm.Double">0</d:BC_10YEAR><d:BC_20YEAR m:type="Edm.Double">0</d:BC_20YEAR><d:BC_30YEAR m:type="Edm.Double">0</d:BC_30YEAR><d:BC_30YEARDISPLAY m:type="Edm.Double">0</d:BC_30YEARDISPLAY></m:properties></content></entry>

this day should be empty

consider merging two layers of classes

Currently there is a class to handle imput parameters and upoload in runner.py and getter functions/classes
in getters submodule. Probably they can be just one class.

Can add thin wrappers for string date handling + uploader functionality.

parsers/parsers/runner.py

Line 105 in acd2a91

#TODO: merge ParserBase and getter.ust.Getter classes

add token access functions

change README.md

error reading XML with m:null="true"

<entry xmlns="http://www.w3.org/2005/Atom">
<id>http://data.treasury.gov/Feed.svc/DailyTreasuryYieldCurveRateData(3458)</id>
<title type="text"></title><updated>2017-11-07T14:01:54Z</updated>
<author><name /></author>
<link rel="edit" title="DailyTreasuryYieldCurveRateDatum" href="DailyTreasuryYieldCurveRateData(3458)" />
<category term="TreasuryDataWarehouseModel.DailyTreasuryYieldCurveRateDatum" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" /><content type="application/xml"><m:properties xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata"><d:Id m:type="Edm.Int32" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">3458</d:Id><d:NEW_DATE m:type="Edm.DateTime" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">2010-10-11T00:00:00</d:NEW_DATE><d:BC_1MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_3MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_6MONTH m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_1YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_2YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_3YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_5YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_7YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_10YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_20YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_30YEAR m:type="Edm.Double" m:null="true" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" /><d:BC_30YEARDISPLAY m:type="Edm.Double" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices">0</d:BC_30YEARDISPLAY></m:properties></content></entry>

separate api token for upload and for delete

parsers/parsers/uploader.py

Lines 3 to 4 in d6e9f17

    
           # FIXME: unsecure 
        
           UPLOAD_API_TOKEN = '123'

review individual parsers

See test coverage in https://codecov.io/gh/mini-kep/parsers/tree/master/parsers/getter

Must have same logic of adding functions to ParserBase class, see repo README.md

what information should we collect about the parsers?

Following issue #1: we have hand-made summary tables in markdown about parser information:

In <definitons.py> we have a way to keep parser information in class and show it to the user as markdown. Some of the summary information about the parser is used as parameters for its invokation.

This issue will result in defining what information about the parsers is needed to run them and to present their summary to the user.

Current attribures are (from here):

class RosstatKEP(Parser):
    name = 'rosstat-kep'
    does_what = 'Parse sections of KEP Rosstat publication'
    freqs = 'aqm'    
    all_varnames = ['CPI_rog', 'RUR_EUR_eop']
    start_date = make_date('1999-01-31')

The issue should result in new parser summaries for all parsers listed here.

The information we specify for each parser class is used to a) bound the call to the parser (eg available variables, frequencies, dat limits) and b) make description more understandable.

add ust to runner.py
change readme.md with newer tables and edit of md in parsers list
#17?

Bigger issues:

runner.py: on a given date must provide a dataset to write to database
new parser: parse rosstat-isep from word or pdf

add varnames property to parser classes

{'BRENT': dict(ru='Цена нефти Brent', en='Brent oil price')}

need cover getter.* modules

There are no tests for:

brent.py
cbr_fx.py
kep.py

Ideally, need to keep tests simple/readable and closer to real, not imaginary risks in the program.

Also, best to keep in mind what tests are:

unittests that control functionality program structure
tests that validate functionality results
integration/end-to-end test
regression/bug fixes

Everyone confortable with these ideas/defintions, @Andres-Unt, @muroslav2909 , @JaroVojtek?

make parsers get_data() method with real data

brent
usdrur
kep

Must keep access tokens outside repo

	# TODO: use parts of code belwo if needed for validate_datapoint()

	class Base_Test_Parser:
	def setup_method(self):
	#must overload this
	self.parser = None
	self.items = None

	def test_get_data_members_are_length_4(self):
	for item in self.items:
	assert len(item) == 4

	def test_get_produces_data_of_correct_types(self):
	for item in self.items:
	assert isinstance(item['date'], str)
	assert isinstance(item['freq'], str)
	assert isinstance(item['name'], str)
	assert isinstance(item['value'], float)

	def test_get_data_item_date_in_valid_format(self):
	dates = (item['date'] for item in self.items)
	for date in dates:
	assert arrow.get(date)

	def test_get_data_item_date_in_valid_range(self):
	dates = (item['date'] for item in self.items)
	for date in dates:
	date = arrow.get(date).date()
	# EP: splitting to see what exactly fails
	assert self.parser.start <= date
	assert date <= datetime.date.today()

	# valid code and good idea to check, but iplementation too copmplex
	# for a base test class

	def test_get_data_produces_values_in_valid_range(self, items, min, max):
	for item in items:
	assert min < item['value'] < max

	# ------------------------ end of datapoitn validation


	class TestRosstatKep(Base_Test_Parser):
	def setup_method(self):
	self.parser = RosstatKEP_Monthly()
	self.items = list(self.parser.sample())

	def test_get_data_produces_values_in_valid_range(self):
	cpi_data = [item for item in self.items
	if item['name'] == 'CPI_rog']
	eur_data = [item for item in self.items
	if item['name'] == 'RUR_EUR_eop']
	super(TestRosstatKep, self)\
	.test_get_data_produces_values_in_valid_range(cpi_data, 90, 110)
	super(TestRosstatKep, self)\
	.test_get_data_produces_values_in_valid_range(eur_data, 50, 80)

	def test_start_date_is_correct(self):
	assert self.parser.start == arrow.get('1999-01-31').date()

	def test_source_url_is_correct(self):
	assert self.parser.reference.get('source_url') == \
	("http://www.gks.ru/wps/wcm/connect/"
	"rosstat_main/rosstat/ru/statistics/"
	"publications/catalog/doc_1140080765391")


	class TestCBR_USD(Base_Test_Parser):
	def setup_method(self):
	self.parser = CBR_USD()
	self.items = list(self.parser.sample())

	def test_get_data_produces_values_in_valid_range(self):
	super(TestCBR_USD, self).test_get_data_produces_values_in_valid_range(self.items, 50, 70)

	def test_start_date_is_correct(self):
	assert self.parser.start == datetime.date(1992, 1, 1)

	def test_source_url_is_correct(self):
	assert self.parser.reference.get('source_url') == ("http://www.cbr.ru/"
	"scripts/Root.asp?PrtId=SXML")

	def test_all_varnames_are_correct(self):
	assert self.parser.reference.get('varnames') == ['USDRUR_CB']


	class TestBrentEIA(Base_Test_Parser):
	def setup_method(self):
	self.parser = BrentEIA()
	self.items = list(self.parser.sample())

	def test_get_data_produces_values_in_valid_range(self):
	super(TestBrentEIA, self).test_get_data_produces_values_in_valid_range(self.items, 20, 120)

	def test_start_date_is_correct(self):
	assert self.parser.start == datetime.date(1987, 5, 15)

	def test_source_url_is_correct(self):
	assert self.parser.reference.get('source_url') == ("https://www.eia.gov/opendata/"
	"qb.php?category=241335")

	def test_all_varnames_are_correct(self):
	assert self.parser.reference.get('varnames') == ['BRENT']

	# IDEA: maybe add a vaidation decorator for all getter fucntions?
	def get_ust_dict(start_date, downloader=util.fetch):
	"""Return UST datapoints as list of dictionaries, based on start_date."""
	year = make_year(start_date)
	url = make_url(year)
	content = downloader(url)
	return parse_xml(content)

	# ERROR: an start_Date - loads just one year, not all years from that date on

mini-kep / parsers Goto Github PK

parsers's People

Contributors

Watchers

Forkers

parsers's Issues

Recommend Projects

Recommend Topics

Recommend Org