openbudget / budgetkey-data-pipelines Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 29.0 13.1 MB

Budget Key data processing pipelines

Python 52.60% Makefile 0.01% Shell 0.22% Dockerfile 0.15% Jupyter Notebook 22.50% ASP.NET 24.52%

budgetkey-data-pipelines's People

Contributors

Stargazers

Watchers

budgetkey-data-pipelines's Issues

Guidestar scarper: Add association (professional) Domains and subdomains

add the subdomain value, as it appears in the association page:

classify subdomains by main domains categories, as they appear on
:

[Data] [Scraper] Manpower and Service Provider Registries Scraping Pipeline

Service Providers scraper

Service Provider Registry is a list of suppliers which the government may contract with for purchasing 'services' (e.g. cleaning, security)

Data is received from scraping a website of the ministry of economy: http://www.economy.gov.il/Employment/WorkRights/LicensesandPermits/Licenses/Pages/NewContractorLicense.aspx

Write a pipeline called moital_service_providers under budgetkey_data_pipelines/pipelines/entities
Pipeline steps should include:
- Scraper for the data in the table, capturing all columns
- Setting proper types for columns
- Dump to file (/var/datapackages/entities/moital_service_providers)

Ref:
How to write a pipeline:
https://github.com/frictionlessdata/datapackage-pipelines
Legacy scraper folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy scraper is in processors/scrape_moital_contractors.py

Example of using selenium can be found under pipelines/entities/special/scraper.py
Example of a generic scraper can be found under pipelines/supports/criteria/scraper.py

Guidestar scarper: voulntrary association data endpoints

We need to verify if/how can we use the following details:

General Description (text):
Story (text):

relevant links (array of label+link):

since it is a voluntary data which might be under copyrights restrictions:

exemptions scraper should run in smart update mode where it updates until no more updates

(relevant after #13 is merged)

reproduction steps

run the exemptions pipeline
it runs the following for each publisher:
- download 1 page of latest data each time
- data is updated in dump.to_sql

expected

once newest data is inserted - stop and continue to next publisher
this might be supported using frictionlessdata/datapackage-pipelines#64
but, maybe there is a different way

actual

currently we limit to number of pages (at the moment it's 5 pages = 50 items)
we insert / update all 50 items, then move to next publisher
if a publisher has more then 50 new items, not all items will be inserted
if publisher has only 1 new entry - it will still update all 50 entries

investigate error in geocode processor: 'result' object has no attribute 'provider'

geocode_entities: INFO    :no update needed: 512206020
geocode_entities: INFO    :no update needed: 512206046
geocode_entities: INFO    :no update needed: 512206079
geocode_entities: INFO    :no update needed: 512206301
geocode_entities: INFO    :no update needed: 512206350
geocode_entities: INFO    :no update needed: 512206384
geocode_entities: Traceback (most recent call last):
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 246, in 
geocode_entities:     spew(datapackage, GeoCodeEntities(parameters, datapackage, resources).filter_resources())
geocode_entities:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/wrapper/wrapper.py", line 60, in spew
geocode_entities:     for rec in res:
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 202, in filter_resource
geocode_entities:     row = self.get_row(entity_id, entity_location)
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 223, in get_row
geocode_entities:     provider, geojson = location_row.provider, location_row.geojson
geocode_entities: AttributeError: 'result' object has no attribute 'provider'
load_sql_table: ERROR   :Output pipe disappeared!

[Data] [Scraper] Scrape the support criteria documents

We'd like to have these documents in a tabular data file: http://www.justice.gov.il/Units/Tmihot/Pages/TestServies.aspx

Write a pipeline called criteria under budgetkey_data_pipelines/pipelines/supports
Pipeline steps should include:
- Scraper for the data in the table, capturing all columns and the file URL
- Setting proper types for columns
- Dump to file (/var/datapackages/supports/criteria)
Bonus points (after above is finished):
- Analyze the text in the title to extract year, whether this is an amendment, and the purpose of the support. Each one of those should go in a separate column.
- Extract text from PDF and store it in a separate column
- Extract Budget Item from PDF and store it in a separate column (doesn't appear in all of the files)

You can use this existing pipeline for inspiration: https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/budgetkey_data_pipelines/pipelines/procurement/spending/collect_report_uris.py
(although keep the processor local and not in the processors directory)

should refactor the exemption publishers mock processor - find a better solution

background

the add_publisher_urls_resource processor is doing some http requests
when running unit tests we don't want to make remote requests - both to keep them fast and because it's not reliable
for that reason, I separated the code into a standalone class (ExemptionsPublisherScraper) - which handles making of the http requests and returning the data.
this class can then be unit tested directly without the pipelines framework

the problem

while the unit test for the class tests most scenarios, we still need to test the add_publisher_urls_resource processor - in the context of the pipelines framework
we need to have a way to call a processor with mock=true parameter which will cause it not to make the http requests but instead return some mock data

expected solution

should have a solution which separates the mock code from the library / processor code
the add_publisher_urls_resource should not contain any code which is used only for mocks or for testing

actual solution

currently the mock code is in the add_publisher_urls_resource processor

Social Map: Create Active Areas per organization

Based on the Active in Localities (i.e. city, settlement) and the Central Bureau of Statistics keys - add each organization a multivalue field of relevant areas.

Each city/settlement is related to a single area. Each area consist multiple cities/settlements

[Data] [Legal] FOI request the list of public good companies

@shevyk

Later:

Get the list of public good companies from company registrar using the suffix (חל״צ)
Make sure that the lists match
If so, continue to scrape data on the public good companies from guidestar as well
Combine the guidestar details with the company registrar details.

[Data] [Scraper] Calcalist manning announcements

See this page:
www.calcalist.co.il/local/home/0,7340,L-3789,00.html

We want to scrape the data here, similar to what we're doing in here with the TheMarker page:
https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/budgetkey_data_pipelines/pipelines/people/appointments/media/pipeline-spec.yaml
(although scraping mechanism will probably be different)

Step 1: Scrape the data and store it individually, using the same schema as the TheMarker source
Step 2: Clean duplicates between the two sources.
Step 3: Make sure that each item in the result contains an indication (or link) to its sources

[Search] [Supports] Design and implement the governmental support card

Now returning under supports.

[Data][Scraper] Cooperatives scraper generates duplicate key

http://next.obudget.org/pipelines/#anchor-FAILED-entities-cooperatives-cooperatives

dump.to_sql: Traceback (most recent call last):
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 467, in do_executemany
dump.to_sql:     cursor.executemany(statement, parameters)
dump.to_sql: psycopg2.IntegrityError: duplicate key value violates unique constraint "cooperatives_pkey"
dump.to_sql: DETAIL:  Key (id)=(570000018) already exists.
dump.to_sql: The above exception was the direct cause of the following exception:
dump.to_sql: Traceback (most recent call last):
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/dump/to_sql.py", line 134, in 
dump.to_sql:     SQLDumper()()
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 33, in __call__
dump.to_sql:     self.stats)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/wrapper/wrapper.py", line 60, in spew
dump.to_sql:     for rec in res:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 69, in hasher
dump.to_sql:     for row in resource:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 59, in row_counter
dump.to_sql:     for row in resource:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/jsontableschema_sql/writer.py", line 57, in write
dump.to_sql:     for wr in self.__insert():
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/jsontableschema_sql/writer.py", line 75, in __insert
dump.to_sql:     statement.execute(self.__buffer)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/base.py", line 386, in execute
dump.to_sql:     return e._execute_clauseelement(self, multiparams, params)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
dump.to_sql:     compiled_sql, distilled_params
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
dump.to_sql:     exc_info
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
dump.to_sql:     reraise(type(exception), exception, tb=exc_tb, cause=cause)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
dump.to_sql:     raise value.with_traceback(tb)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 467, in do_executemany
dump.to_sql:     cursor.executemany(statement, parameters)
dump.to_sql: sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "cooperatives_pkey"
dump.to_sql: DETAIL:  Key (id)=(570000018) already exists.
dump.to_sql:  [SQL: 'INSERT INTO cooperatives (id, name, registration_date, phone, primary_type_id, primary_type, secondary_type_id, secondary_type, legal_status_id, legal_status, last_status_date, type, municipality_id, municipality, inspector, address) VALUES (%(id)s, %(name)s, %(registration_date)s, %(phone)s, %(primary_type_id)s, %(primary_type)s, %(secondary_type_id)s, %(secondary_type)s, %(legal_status_id)s, %(legal_status)s, %(last_status_date)s, %(type)s, %(municipality_id)s, %(municipality)s, %(inspector)s, %(address)s)'] [parameters: ({'id': '570000018', 'name': 'נחלת ישראל רמה אגודה שתופית בע"מ (במחיקה)', 'registration_date': datetime.datetime(1921, 2, 6, 0, 0), 'phone': '', 'primary_type_id': '43', 'primary_type': 'שיכון', 'secondary_type_id': '61', 'secondary_type': 'שיכון', 'legal_status_id': '23', 'legal_status': 'הודעה שניה על מחיקה', 'last_status_date': datetime.datetime(1970, 1, 29, 12, 0), 'type': 'התאחדות האיכרים', 'municipality_id': None, 'municipality': '', 'inspector': 'צוק חיים', 'address': 'דופקר ++++'}, {'id': '570000026', 'name': 'הלואה וחסכון ירושלים אגודה שיתופית בע"מ', 'registration_date': datetime.datetime(1921, 3, 24, 10, 25), 'phone': '02-6234432', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1921, 3, 24, 8, 13, 46), 'type': '', 'municipality_id': '3000', 'municipality': 'ירושלים', 'inspector': 'בן-חמו יוסף', 'address': 'קרן היסוד 41  ירושלים 94188 ת.ד:  2575'}, {'id': '570000034', 'name': 'הלואה וחיסכון זכרון יעקב אגודה הדדית בע"מ', 'registration_date': datetime.datetime(1921, 10, 16, 0, 0), 'phone': '', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '26', 'legal_status': 'אגודה מוזגה - מבוטלת', 'last_status_date': datetime.datetime(1977, 1, 29, 12, 0), 'type': '', 'municipality_id': '9300', 'municipality': 'זכרון יעקב', 'inspector': 'חליל יעקב', 'address': 'הנציב       זכרון יעקב     מיקוד ת.ד:  8'}, {'id': '570000042', 'name': 'קיבוץ איילת השחר', 'registration_date': datetime.datetime(1921, 12, 19, 12, 52), 'phone': '04-6932111', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '74', 'secondary_type': 'קיבוץ מתחדש', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1921, 12, 19, 13, 27, 17), 'type': 'תנועה קבוצית מאוחדת  תק"מ', 'municipality_id': '77', 'municipality': 'איילת השחר', 'inspector': 'מור טל', 'address': 'ד.נ. גליל עליון   איילת השחר 12200'}, {'id': '570000059', 'name': 'אגודה שיתופית לעזרה הדדית ברחובות בע"מ', 'registration_date': datetime.datetime(1921, 12, 19, 0, 0), 'phone': '', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '25', 'legal_status': 'אגודה בוטלה  לאחר פירוק', 'last_status_date': datetime.datetime(1987, 5, 21, 12, 0), 'type': '', 'municipality_id': '8400', 'municipality': 'רחובות', 'inspector': 'יהודה משה', 'address': 'רחובות   רחובות'}, {'id': '570000067', 'name': 'הכפר העברי - אגודה הדדית בע"מ', 'registration_date': datetime.datetime(1922, 2, 2, 0, 0), 'phone': '02-5700756', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '54', 'secondary_type': 'אגודה חקלאית כללית', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1922, 2, 2, 12, 0), 'type': '', 'municipality_id': '3000', 'municipality': 'ירושלים', 'inspector': 'בן-חמו יוסף', 'address': 'נמצא אצל אבנאור שמואל, אביזהר, אחוזת בית הכ 8 כניסה: 247 ירושלים 96267'}, {'id': '570000075', 'name': 'החקלאית אג"ש לבטוח ולשרותים וטרינריים למקנה בישראל בעמ', 'registration_date': datetime.datetime(1922, 4, 11, 0, 0), 'phone': '04-6279600', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '57', 'secondary_type': 'ביטוח חקלאי', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1922, 4, 11, 12, 0), 'type': '', 'municipality_id': '1167', 'municipality': 'קיסריה', 'inspector': 'שרעבי מזל', 'address': 'הברקת 20  קיסריה 38900 ת.ד:  3039'}, {'id': '570000083', 'name': 'קופת מלוה חקלאית לגמ"ח אגודה שיתופית פתח תקוה בע"מ', 'registration_date': datetime.datetime(1922, 6, 29, 0, 0), 'phone': '', 'primary_type_id': '45', 'primary_type': 'תגמולים פנסיה ועזה"ד', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '25', 'legal_status': 'אגודה בוטלה  לאחר פירוק', 'last_status_date': datetime.datetime(1992, 12, 14, 12, 0), 'type': '', 'municipality_id': '7900', 'municipality': 'פתח תקוה', 'inspector': '', 'address': 'מונטיפיורי   14  פתח תקוה 49364'}  ... displaying 10 of 1001 total bound parameter sets ...  {'id': '570010009', 'name': 'הכורם הצעיר אגודה שיתופית חקלאית להספקת מים ברחובות בע"מ', 'registration_date': datetime.datetime(1950, 10, 23, 0, 0), 'phone': '08-9461817', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '56', 'secondary_type': 'אספקת מים', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1950, 10, 23, 12, 0), 'type': '', 'municipality_id': '8400', 'municipality': 'רחובות', 'inspector': 'בר-נתן יונה', 'address': 'הרצל 143  רחובות 76266'}, {'id': '570010017', 'name': 'מעונות עובדי קופת חולים ב\' בגבעתיים אגודה שיתופית בע"מ', 'registration_date': datetime.datetime(1950, 10, 23, 0, 0), 'phone': '03-6250814', 'primary_type_id': '43', 'primary_type': 'שיכון', 'secondary_type_id': '61', 'secondary_type': 'שיכון', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1950, 10, 23, 12, 0), 'type': '', 'municipality_id': '6300', 'municipality': 'גבעתיים', 'inspector': 'בר-נתן יונה', 'address': 'נמצא אצל בוריס בריק, מצולות ים 14  גבעתיים 53486'})]
(sink): /usr/local/lib/python3.6/site-packages/jsontableschema/model.py:48: UserWarning: Class models.SchemaModel is deprecated [v0.7-v1)
(sink):   warnings.warn(message, UserWarning)

add es:time-range field to all doc types

es:time-range is used to filter / show sums of counts per time range

after PR #102 only supports and tenders will have this field, need to add it to all doc types

[Data][Budget-Connections] Make education connections between 2014-2015 are incorporated

And if they're not - check why.

Sitemap generator

Generate sitemal xmls for all records in the website + links.
Make it scriptable (+note Google's limitations on size of XMLs)

Some relevant links:

missing exemption data

I don't find an exemption and I afraid that it's not the only one that missing

https://www.mr.gov.il/ExemptionMessage/Pages/ExemptionMessage.aspx?pID=592899

select *
from exemption e
where publication_id = 592899 or publication_id = '592899'

investigate failed tests: test_cooperatives, tenders/test_parse_data.py

looks like test_parse_data failure was introduced here

and test_cooperatives failure was introduced here

Municipal Representatives scraper

In this file we have a list of all elected representatives in all municipalities

http://www.moin.gov.il/PressAnnouncements/Pages/%D7%91%D7%97%D7%99%D7%A8%D7%95%D7%AA-%D7%9C%D7%A8%D7%A9%D7%95%D7%99%D7%95%D7%AA-%D7%94%D7%9E%D7%A7%D7%95%D7%9E%D7%99%D7%95%D7%AA-----14-11-2013---.aspx

We'd like to write a pipeline to scrape this file and add it to the current 'people' pipelines as another role for each person.

Ottoman Associations Scraping

The list of Ottoman Associations can be found here: https://foi.gov.il/he/node/1908

You need to write a pipeline that converts the Excel file into a simple tabular resource with 3 columns:

id
name
address

The pipeline should be named 'ottoman-associations' and be located under entities/

need to update entities_geo db table schema

in PR #67 there were some DB schema changes which weren't applied

I think it causes the failure in the pipeline

[Data] [Enrichment] Geocode addresses of entities

Add a processor that does some geocoding for addresses of entities.

Make sure requests are cached locally so that multiple requests for the same address will be resolved without hitting network.
Collect statistics on misses
Employ heuristics for improving the geolocation success rate without causing false positives

investigate discrepency in tender scraper number of rows

Logged number of rows from pipeline

scraper-office

add_publisher_urls_resource: INFO :Processed 10362 rows
dump.to_path: INFO :Processed 10362 rows

scraper-exemptions

add_publisher_urls_resource: INFO :Processed 94878 rows
dump.to_path: INFO :Processed 94878 rows

scraper-central

add_central_urls_resource: INFO :Processed 123 rows
dump.to_path: INFO :Processed 123 rows

count of rows in DB:

TENDER_TYPE	COUNT
office	10,377
exemptions	94,898
central	1

investigate exemption which have the same number in volume and in supplier_id

reproduction

https://next.obudget.org/search/all/%D7%9E%D7%99%D7%A7%D7%A8%D7%95%D7%A1%D7%95%D7%A4%D7%98/1992-01-01/2019-01-01/10/0

expected

I assume volume amount should be different then supplier_id

actual

they are the same (if not a bug, it's an amazing coincidence!)

[Data] [Quality] Debug problematic spending reports

There are about 300 spending reports coming from the government. Each one of them with a slightly different format and there’s a pipeline that’s supposed to consolidate them into one (it’s everything in procurement/spending).

But - it doesn’t always succeed and there are errors in the process:

sometimes because the original data is shit
sometimes because our processor misses some bits

there’s this report which is generated during the process: http://data.obudget.org/queries/1153/source

At the moment there are 59 problematic reports:

some are okay, but only a few rows are bad.
some are completely erroneous

The task is to:

Go over each of the bad reports
Understand what’s the problem there and
a. fix the processor
b. decide that the data is shit (unrecoverable)
c. fix the data manually and add to the pipeline a ‘manual’ source

Also - record the decisions somehow (maybe incorporate it in the pipeline) so that in the future we can query only the reports which were not analysed.

association_annual_turnover not collected

this is מחזור שנתי
sample organisation where it's showing: http://www.guidestar.org.il/GS_Malkar?number=580294940
processor is called 'guidestar'

[Data] [Quality] Research entity name resolution misses

We try to find a proper entity from the official registries when all we have is the name (e.g. in the supports dataset, exemptions and sometimes with contract spending).

Most of the time we succeed, but not always.

This task is to understand why we miss and improve the detection rate.

requests to docker cloud CI return 404

there is a failure from docker cloud but I can't debug it because all requests return 404

from PR #69 - https://cloud.docker.com/redirect?resource_uri=/api/audit/v1/action/b267b9a7-e2a8-4e83-81d3-4a2ac4938613/

[Data] [Scraper] Governmental and Ministerial Tenders

Generalise the exemption scraper to scrape these two sections in the procurement department:

http://www.mr.gov.il/CentralTenders/Pages/SearchTenders.aspx
http://www.mr.gov.il/OfficesTenders/Pages/SearchOfficeTenders.aspx
(notice the tabs in the first one)

The data from these pipelines should end up in the same table (unless the schemas are too different so that it doesn't make sense).

pulling code from remote master branch to origin master branch should not cause a merge commit

look at this merge commit - OriHoch@28bb9d8

it was done when I pulled latest changes from OpenBudget master branch to my master branch

my master branch didn't have any local changes, so it should be a fast-forward without a merge commit

the merge commit is for these 2 commits:

cf8095d - this commit is on OpenBudget master branch
f2899e6 - this commit is not on OpenBudget mastar branch

however, there is a commit which has the same title and time which is on OpenBudget master branch:
2dcd32e

not sure exactly what's going on here

@akariv - is it possible you are editing history?

should add exemptions (פטור ממכרז) to pipelines (based on the legacy tenders/exemption_scraper)

Exemption scraper

Exemptions are records for planned purchases of the government that are exempt from a tender process.

The website to scrape is quite slow and buggy, so we want to scrape frequently, and only get the most recent records (for example, scrape every few hours and get a day's worth of records).
(For initialisation there should also be a one-time-run pipeline that scrapes the entire website)

Data is written to a DB table (exemptions), with the update mode - so that records with the same publication id will be updated and data is appended and not replaced.

Scraped data includes the basic list of fields for each exemption record, as well as the list of documents for the record (which should be saved as a JSON object).

There might be issues regarding scraping from non Israeli IPs - talk to me if you encounter such issues.

Ref:
Legacy scraper folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy scraper is in tenders/exemption_scraper.py

[Data] [Scrape] Cooperative Registrar

https://apps.moital.gov.il/CooperativeSocieties/

Create new pipeline under 'entities/cooperatives'
Scrape the cooperative registrar. Note that they have an API that allows getting all the data as JSON in one request.
Connect to all entities grand list, with the proper kind ('cooperative')

tenders exemptions should parse signed documents

reproduction

goto: http://www.obudget.org/#entity/580050789/publication/574896
click on the document (4/12/2015 | חוות דעת מקצועית)

expected

should get the document

actual

get a url to file with .signed extension
- http://www.obudget.org/api/exemption/document?url=http://www.mr.gov.il/Files_Michrazim/201813.signed
this file is an XML containing signed government document, which causes a server error
this .signed file is coming from the exemptions pipeline - so I assume fixing it in the pipelines will fix the api/frontend bug as well

notes

need to figure out if API should be fixed as well - to show some error message, or skip .signed files
the relevant API url - http://www.obudget.org/api/entity/580050789 (search for .signed)

Guidestar scarper: Add association endpoints

Build a pipeline to scrape the the following data from guide star association item page:

Number of workers (integer)
Number of volunteers (integer)
(check this link for "No report" placeholder text example)

Activity location(s): (strings array)
Annual cycle (integer/float):
Salaeries (+names) - up to 5 positions (position (string), amount (integer/float)) :

Budget change data from data.gov.il

[Documentation] should add a link somewhere in docs to blog post explaining the fiscal categories

maybe in README, link to this post explaining the terms budget / spending / procurement

https://blog.okfn.org/2017/05/18/what-is-the-difference-between-budget-spending-and-procurement-data/

Bank Of Israel scraper

There are a lot of data series in the Bank Of Israel website.

Develop a processor that can be configured to a list of such series and creates a dataset with the following columns: series-name, timestamp, value

Sample link:
http://www.boi.org.il/he/DataAndStatistics/Pages/SeriesSearchBySubject.aspx?Level=4&sId=16

And the data can be scraped from:
http://www.boi.org.il/he/DataAndStatistics/Pages/[email protected]&DateStart=01/01/1975&DateEnd=31/08/2017&Level=4&Sid=16

Scrape political agreements and government decisions related to budget items

See here:
http://mof.gov.il/BudgetSite/statebudget/Pages/PoliticalAgreements.aspx

There are a few files detailing budget items (Takanot) related to political agreements and government decisions.

The task is to write a pipeline to scrape these files and store the result in a DB table.

Social Map: add annual financial reports

#97 follow-up:

For each year get the Financial Report from GuideStar. Year with no available online report should be marked as null/None

[Data] [Scraper] Fix the political donations in primaries scraper

It has been failing for some time, probably something changed on the source.

The code itself is in the donations/ folder.
Current failure can be seen in http://next.obudget.org/pipelines

While you're at it, also add the rest of the political donation databases at the Mevaker Hamedina website.

[Data][Processing] Analyize support criteria titles to extract extra properties

We scrape a list of support criteria documents (מבחני תמיכה).

We want to analyze the title of these documents (which is usually in a very similar template) and extract properties such as:

Funding period
Amendment yes/no
Purpose of the support
etc.

Tasks:

Take the list of titles from the query http://data.obudget.org/queries/1203/source
Write some Python code to analyze the titles and extract the different attributes out of them
Incorporate your code into the pipeline at supports/criteria

Sample of such titles:

תיקון למבחנים של משרד הבריאות לצורך תמיכה בבתי חולים ציבוריים במצוקה
מבחנים לצורך תמיכה של משרד החינוך העוסקים בהפעלת מתנדבים במערכת החינוך
מבחנים לצורך תמיכה של משרד החינוך בפעילות שוטפת של כפרי סטודנטים
מבחנים לחלוקת כספים לצורך תמיכה של המשרד להגנת הסביבה במוסדות ציבורי העוסקים בטיפול דחוף בקופים הנמצאים בסכנת קיום
תיקון למבחנים לחלוקת כספי תמיכות של משרד הרווחה והשירותים החברתיים לשיפורי מיגון ובינוי לצורך מיגון מסגרות רווחה חוץ–ביתיות
מבחנים לתמיכה של משרד התרבות והספורט במוסדות המשמשים כארגון גג בתחום התרבות
מבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט במוסדות ציבור המקיימים פסטיבלים בתחום אמנויות הבמה
מבחנים לתמיכה של משרד התרבות והספורט למוסדות ציבור בתחום יוצרי מחול עצמאיים, פרויקטים בתחום המחול ומרכזי מחול
תיקון למבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט במוסדות תרבות העוסקים בהוראת האמנויות
מבחנים למתן תמיכות של משרד העלייה והקליטה למוסדות ציבור המבצעים פעולות לקליטה בקהילה
תיקון למבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט למוסדות ציבור בתחום המוזאונים המוכרים
תיקון למבחנים לחלוקת כספי תמיכה של משרד התרבות והספורט בספריות המיועדות לעיוורים ולליקווי ראייה
תיקון למבחנים לתמיכה של משרד התרבות והספורט בתחרויות מוסיקה בין–לאומיות
תיקון למבחנים של משרד הבריאות לצורך תמיכה בהוצאות הניתוחים והאשפוז של קופות החולים המבצעות תכנית לקיצור תורים לשנת התקציב 2016
תיקון למבחנים למתן תמיכות של משרד הרווחה והשירותים החברתיים למוסדות ציבור
תיקון למבחנים למתן תמיכות של משרד הרווחה והשירותים החברתיים למוסדות ציבור המפעילים תכניות למתנדבים השייכים לאוכלוסיות הרווחה לצורך שיקומם
תיקונים למבחנים לחלוקת כספים לצורך תמיכה של משרד החינוך במוסדות תורניים -לימוד ופעולות
מבחנים לצורך תמיכה של משרד החינוך בקונסרבטוריונים
מבחן תמיכה של משרד החינוך במוסדות ציבור העוסקים בתחום מורשת כוחות המגן והמחתרות שפעלו בתקופה שקדמה להקמת המדינה
מחנים למתן תמיכות של משרד התרבות והספורט במכוני מחקר תורניים בעלי חשיבות לאומית

guidestar data: mark data source for each endpoint

endpoint	data source	comments
example	example	example

[Data] [Infra] procurement-tenders-exemptions should update existing items (periodically or by other logic)

reproduction

procurement-tenders-exemptions pipeline runs daily and inserts new exemption (after #13 is merged)
after a week an exemption's details are updated

expected

the updated exemption's details should be updated

actual

only new exemptions are inserted (based on whether they exist based on ID)

notes

should consider some logic for how / when to update
@akariv suggested a solution based on the exemption status:
- This will allow in the future more correct behaviours, such as periodically re-scraping records that are still "in progress" (i.e data might still be changing) and not scraping records that have been concluded.
a solution which doesn't rely on the underlying data -
- have a column of last scrape date in DB
- on every daily run (or a different, maybe hourly run) - update a limited number of exemptions which were updated the earliest
- on every updated exemption the last scrape date is updated
- that wait, it will slowly go over all exemptions and update them

Monetary Change transaction grouping

Monetary changes are records of single changes to 3rd level budget items (a.ka. "6-digit" items).
During the year, the national budget is modified, and these changes represent individual modifications (Some of these changes are brought to the finance committee of the Knesset to be approved, some don't).

We download the data and store it in a DB (see here: http://data.obudget.org/queries/1061/source)

The pipeline for the downloading and processing the data is here: https://github.com/OpenBudget/budgetkey-data-pipelines/tree/master/budgetkey_data_pipelines/pipelines/budget/national/changes/original

All changes are part of a request - the request ID is uniquely identified by the (leading_item, req_code) fields. Those requests that are brought to the committee for approval, also have a non-zero committee_id.

Some of these requests are part of a larger transaction - for example, when moving funds from one ministry to another, we'll see a request for moving funds from ministry A to the general reserve and another request for moving the same amount from the general reserve to ministry B.

The goal of this issue is to assign a 'transaction id' to all changes - the same transaction id should be assigned to all changes that belong to the same transaction.

You should create a pipeline and processors in budget/national/changes/processed/
Pipeline should be called 'transactions' and should generate a 3 column file:

leading_item
req_code
transaction_id
with one row per transaction_id.

Ref:
Legacy processing code folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy algorithm is in tenders/extract_change_groups.py

should ensure same version of datapackage-pipelines dependency runs on all environments

preconditions

datapackage-pipelines version was published to pypi
but, docker file was not updated - so contains previous version
however - code in budgetkey-data-pipelines relies on the latest version and breaks on previous datapackage-pipelines version

reproduction steps

run the budgetkey-data-pipelines docker

expected

docker should run with the latest pypi version which the code was tested against

Actual

code fails because it runs with previous datapackage-pipelines version from docker
we pull latest docker image which has the previous datapackage-pipelines version
- FROM frictionlessdata/datapackage-pipelines:latest
budgetkey-data-pipelines is installed like this:
- RUN sudo pip install -e /
setup.py doesn't contains the datapackage-pipelines dependency, so won't upgrade it (no --upgrade param)
end up with previous datapackage-pipelines version

Social Map save past data

update data structure to include past data (archive) per organization. We need to have the archive of past data (in year resolution) of the followings:

annual turnover
number of workers
number of volunteers
Nihul Takin
Salaries

We can integrate the archiving process in real time (what the specific entry is being updated) or once a month. It would be better to keep a whole snapshot of all relevant entries in a certain point of time, to avoid misunderstandings and adjustments in data interpretation.

Build a pipeline for scraping the list of government owned companies: name, details page URL
Build a dependent pipeline to scrape the detail pages: any information that can be extracted from there, e.g.:
- purpose,
- officers,
- directors,
- etc.
Build a dependent pipeline that reads the activity summary PDFs and extract the text within them. Especially, the registration id of the company, ownership in other companies etc.

investigate error: MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk.

hasn't happened again, it might have been fixed, but worth to investigate and ensure it's indeed fixed

https://gist.github.com/OriHoch/76676e82f4b02053894557843468f2d2

Guidestar scarper: Association_annual_turnover to number

Remove the NIS sign, keep the value, so we can calculate AVG, MEDIAN and MOD

openbudget / budgetkey-data-pipelines Goto Github PK

budgetkey-data-pipelines's People

Contributors

Stargazers

Watchers

Forkers

budgetkey-data-pipelines's Issues

reproduction steps

expected

actual

background

the problem

expected solution

actual solution

Logged number of rows from pipeline

scraper-office

scraper-exemptions

scraper-central

count of rows in DB:

reproduction

expected

actual

reproduction

expected

actual

notes

reproduction

expected

actual

notes

preconditions

reproduction steps

expected

Actual

Recommend Projects

Recommend Topics

Recommend Org