Giter Site home page Giter Site logo

budgetkey-data-pipelines's People

Contributors

aiah123 avatar akariv avatar asfuna avatar barakyosi avatar barvolunteering avatar gidgid avatar ilyalyudevig avatar justinclift avatar maaikeb avatar maxls96 avatar naamac avatar noamoss avatar odedsh avatar orihoch avatar reutsharabani avatar simonbor avatar slallum avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

budgetkey-data-pipelines's Issues

[Data] [Scraper] Manpower and Service Provider Registries Scraping Pipeline

Service Providers scraper

Service Provider Registry is a list of suppliers which the government may contract with for purchasing 'services' (e.g. cleaning, security)

Data is received from scraping a website of the ministry of economy: http://www.economy.gov.il/Employment/WorkRights/LicensesandPermits/Licenses/Pages/NewContractorLicense.aspx

  • Write a pipeline called moital_service_providers under budgetkey_data_pipelines/pipelines/entities
  • Pipeline steps should include:
    • Scraper for the data in the table, capturing all columns
    • Setting proper types for columns
    • Dump to file (/var/datapackages/entities/moital_service_providers)

Ref:
How to write a pipeline:
https://github.com/frictionlessdata/datapackage-pipelines
Legacy scraper folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy scraper is in processors/scrape_moital_contractors.py

Example of using selenium can be found under pipelines/entities/special/scraper.py
Example of a generic scraper can be found under pipelines/supports/criteria/scraper.py

exemptions scraper should run in smart update mode where it updates until no more updates

(relevant after #13 is merged)

reproduction steps

  • run the exemptions pipeline
  • it runs the following for each publisher:
    • download 1 page of latest data each time
    • data is updated in dump.to_sql

expected

actual

  • currently we limit to number of pages (at the moment it's 5 pages = 50 items)
  • we insert / update all 50 items, then move to next publisher
  • if a publisher has more then 50 new items, not all items will be inserted
  • if publisher has only 1 new entry - it will still update all 50 entries

investigate error in geocode processor: 'result' object has no attribute 'provider'

geocode_entities: INFO    :no update needed: 512206020
geocode_entities: INFO    :no update needed: 512206046
geocode_entities: INFO    :no update needed: 512206079
geocode_entities: INFO    :no update needed: 512206301
geocode_entities: INFO    :no update needed: 512206350
geocode_entities: INFO    :no update needed: 512206384
geocode_entities: Traceback (most recent call last):
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 246, in 
geocode_entities:     spew(datapackage, GeoCodeEntities(parameters, datapackage, resources).filter_resources())
geocode_entities:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/wrapper/wrapper.py", line 60, in spew
geocode_entities:     for rec in res:
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 202, in filter_resource
geocode_entities:     row = self.get_row(entity_id, entity_location)
geocode_entities:   File "/budgetkey_data_pipelines/pipelines/entities/geocode_entities.py", line 223, in get_row
geocode_entities:     provider, geojson = location_row.provider, location_row.geojson
geocode_entities: AttributeError: 'result' object has no attribute 'provider'
load_sql_table: ERROR   :Output pipe disappeared!

[Data] [Scraper] Scrape the support criteria documents

We'd like to have these documents in a tabular data file: http://www.justice.gov.il/Units/Tmihot/Pages/TestServies.aspx

  • Write a pipeline called criteria under budgetkey_data_pipelines/pipelines/supports
  • Pipeline steps should include:
    • Scraper for the data in the table, capturing all columns and the file URL
    • Setting proper types for columns
    • Dump to file (/var/datapackages/supports/criteria)
  • Bonus points (after above is finished):
    • Analyze the text in the title to extract year, whether this is an amendment, and the purpose of the support. Each one of those should go in a separate column.
    • Extract text from PDF and store it in a separate column
    • Extract Budget Item from PDF and store it in a separate column (doesn't appear in all of the files)

You can use this existing pipeline for inspiration: https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/budgetkey_data_pipelines/pipelines/procurement/spending/collect_report_uris.py
(although keep the processor local and not in the processors directory)

should refactor the exemption publishers mock processor - find a better solution

background

  • the add_publisher_urls_resource processor is doing some http requests
  • when running unit tests we don't want to make remote requests - both to keep them fast and because it's not reliable
  • for that reason, I separated the code into a standalone class (ExemptionsPublisherScraper) - which handles making of the http requests and returning the data.
  • this class can then be unit tested directly without the pipelines framework

the problem

  • while the unit test for the class tests most scenarios, we still need to test the add_publisher_urls_resource processor - in the context of the pipelines framework
  • we need to have a way to call a processor with mock=true parameter which will cause it not to make the http requests but instead return some mock data

expected solution

  • should have a solution which separates the mock code from the library / processor code
  • the add_publisher_urls_resource should not contain any code which is used only for mocks or for testing

actual solution

  • currently the mock code is in the add_publisher_urls_resource processor

Social Map: Create Active Areas per organization

Based on the Active in Localities (i.e. city, settlement) and the Central Bureau of Statistics keys - add each organization a multivalue field of relevant areas.

Each city/settlement is related to a single area. Each area consist multiple cities/settlements

[Data] [Legal] FOI request the list of public good companies

@shevyk

Later:

  • Get the list of public good companies from company registrar using the suffix (חל״צ)
  • Make sure that the lists match
  • If so, continue to scrape data on the public good companies from guidestar as well
  • Combine the guidestar details with the company registrar details.

[Data] [Scraper] Calcalist manning announcements

See this page:
www.calcalist.co.il/local/home/0,7340,L-3789,00.html

We want to scrape the data here, similar to what we're doing in here with the TheMarker page:
https://github.com/OpenBudget/budgetkey-data-pipelines/blob/master/budgetkey_data_pipelines/pipelines/people/appointments/media/pipeline-spec.yaml
(although scraping mechanism will probably be different)

Step 1: Scrape the data and store it individually, using the same schema as the TheMarker source
Step 2: Clean duplicates between the two sources.
Step 3: Make sure that each item in the result contains an indication (or link) to its sources

[Data][Scraper] Cooperatives scraper generates duplicate key

http://next.obudget.org/pipelines/#anchor-FAILED-entities-cooperatives-cooperatives

dump.to_sql: Traceback (most recent call last):
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 467, in do_executemany
dump.to_sql:     cursor.executemany(statement, parameters)
dump.to_sql: psycopg2.IntegrityError: duplicate key value violates unique constraint "cooperatives_pkey"
dump.to_sql: DETAIL:  Key (id)=(570000018) already exists.
dump.to_sql: The above exception was the direct cause of the following exception:
dump.to_sql: Traceback (most recent call last):
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/specs/../lib/dump/to_sql.py", line 134, in 
dump.to_sql:     SQLDumper()()
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 33, in __call__
dump.to_sql:     self.stats)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/wrapper/wrapper.py", line 60, in spew
dump.to_sql:     for rec in res:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 69, in hasher
dump.to_sql:     for row in resource:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/datapackage_pipelines/lib/dump/dumper_base.py", line 59, in row_counter
dump.to_sql:     for row in resource:
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/jsontableschema_sql/writer.py", line 57, in write
dump.to_sql:     for wr in self.__insert():
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/jsontableschema_sql/writer.py", line 75, in __insert
dump.to_sql:     statement.execute(self.__buffer)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/sql/base.py", line 386, in execute
dump.to_sql:     return e._execute_clauseelement(self, multiparams, params)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
dump.to_sql:     compiled_sql, distilled_params
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
dump.to_sql:     exc_info
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
dump.to_sql:     reraise(type(exception), exception, tb=exc_tb, cause=cause)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 186, in reraise
dump.to_sql:     raise value.with_traceback(tb)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1159, in _execute_context
dump.to_sql:     context)
dump.to_sql:   File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 467, in do_executemany
dump.to_sql:     cursor.executemany(statement, parameters)
dump.to_sql: sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "cooperatives_pkey"
dump.to_sql: DETAIL:  Key (id)=(570000018) already exists.
dump.to_sql:  [SQL: 'INSERT INTO cooperatives (id, name, registration_date, phone, primary_type_id, primary_type, secondary_type_id, secondary_type, legal_status_id, legal_status, last_status_date, type, municipality_id, municipality, inspector, address) VALUES (%(id)s, %(name)s, %(registration_date)s, %(phone)s, %(primary_type_id)s, %(primary_type)s, %(secondary_type_id)s, %(secondary_type)s, %(legal_status_id)s, %(legal_status)s, %(last_status_date)s, %(type)s, %(municipality_id)s, %(municipality)s, %(inspector)s, %(address)s)'] [parameters: ({'id': '570000018', 'name': 'נחלת ישראל רמה אגודה שתופית בע"מ (במחיקה)', 'registration_date': datetime.datetime(1921, 2, 6, 0, 0), 'phone': '', 'primary_type_id': '43', 'primary_type': 'שיכון', 'secondary_type_id': '61', 'secondary_type': 'שיכון', 'legal_status_id': '23', 'legal_status': 'הודעה שניה על מחיקה', 'last_status_date': datetime.datetime(1970, 1, 29, 12, 0), 'type': 'התאחדות האיכרים', 'municipality_id': None, 'municipality': '', 'inspector': 'צוק חיים', 'address': 'דופקר ++++'}, {'id': '570000026', 'name': 'הלואה וחסכון ירושלים אגודה שיתופית בע"מ', 'registration_date': datetime.datetime(1921, 3, 24, 10, 25), 'phone': '02-6234432', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1921, 3, 24, 8, 13, 46), 'type': '', 'municipality_id': '3000', 'municipality': 'ירושלים', 'inspector': 'בן-חמו יוסף', 'address': 'קרן היסוד 41  ירושלים 94188 ת.ד:  2575'}, {'id': '570000034', 'name': 'הלואה וחיסכון זכרון יעקב אגודה הדדית בע"מ', 'registration_date': datetime.datetime(1921, 10, 16, 0, 0), 'phone': '', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '26', 'legal_status': 'אגודה מוזגה - מבוטלת', 'last_status_date': datetime.datetime(1977, 1, 29, 12, 0), 'type': '', 'municipality_id': '9300', 'municipality': 'זכרון יעקב', 'inspector': 'חליל יעקב', 'address': 'הנציב       זכרון יעקב     מיקוד ת.ד:  8'}, {'id': '570000042', 'name': 'קיבוץ איילת השחר', 'registration_date': datetime.datetime(1921, 12, 19, 12, 52), 'phone': '04-6932111', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '74', 'secondary_type': 'קיבוץ מתחדש', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1921, 12, 19, 13, 27, 17), 'type': 'תנועה קבוצית מאוחדת  תק"מ', 'municipality_id': '77', 'municipality': 'איילת השחר', 'inspector': 'מור טל', 'address': 'ד.נ. גליל עליון   איילת השחר 12200'}, {'id': '570000059', 'name': 'אגודה שיתופית לעזרה הדדית ברחובות בע"מ', 'registration_date': datetime.datetime(1921, 12, 19, 0, 0), 'phone': '', 'primary_type_id': '44', 'primary_type': 'אשראי וחסכון', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '25', 'legal_status': 'אגודה בוטלה  לאחר פירוק', 'last_status_date': datetime.datetime(1987, 5, 21, 12, 0), 'type': '', 'municipality_id': '8400', 'municipality': 'רחובות', 'inspector': 'יהודה משה', 'address': 'רחובות   רחובות'}, {'id': '570000067', 'name': 'הכפר העברי - אגודה הדדית בע"מ', 'registration_date': datetime.datetime(1922, 2, 2, 0, 0), 'phone': '02-5700756', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '54', 'secondary_type': 'אגודה חקלאית כללית', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1922, 2, 2, 12, 0), 'type': '', 'municipality_id': '3000', 'municipality': 'ירושלים', 'inspector': 'בן-חמו יוסף', 'address': 'נמצא אצל אבנאור שמואל, אביזהר, אחוזת בית הכ 8 כניסה: 247 ירושלים 96267'}, {'id': '570000075', 'name': 'החקלאית אג"ש לבטוח ולשרותים וטרינריים למקנה בישראל בעמ', 'registration_date': datetime.datetime(1922, 4, 11, 0, 0), 'phone': '04-6279600', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '57', 'secondary_type': 'ביטוח חקלאי', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1922, 4, 11, 12, 0), 'type': '', 'municipality_id': '1167', 'municipality': 'קיסריה', 'inspector': 'שרעבי מזל', 'address': 'הברקת 20  קיסריה 38900 ת.ד:  3039'}, {'id': '570000083', 'name': 'קופת מלוה חקלאית לגמ"ח אגודה שיתופית פתח תקוה בע"מ', 'registration_date': datetime.datetime(1922, 6, 29, 0, 0), 'phone': '', 'primary_type_id': '45', 'primary_type': 'תגמולים פנסיה ועזה"ד', 'secondary_type_id': '63', 'secondary_type': 'אשראי', 'legal_status_id': '25', 'legal_status': 'אגודה בוטלה  לאחר פירוק', 'last_status_date': datetime.datetime(1992, 12, 14, 12, 0), 'type': '', 'municipality_id': '7900', 'municipality': 'פתח תקוה', 'inspector': '', 'address': 'מונטיפיורי   14  פתח תקוה 49364'}  ... displaying 10 of 1001 total bound parameter sets ...  {'id': '570010009', 'name': 'הכורם הצעיר אגודה שיתופית חקלאית להספקת מים ברחובות בע"מ', 'registration_date': datetime.datetime(1950, 10, 23, 0, 0), 'phone': '08-9461817', 'primary_type_id': '40', 'primary_type': 'חקלאות', 'secondary_type_id': '56', 'secondary_type': 'אספקת מים', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1950, 10, 23, 12, 0), 'type': '', 'municipality_id': '8400', 'municipality': 'רחובות', 'inspector': 'בר-נתן יונה', 'address': 'הרצל 143  רחובות 76266'}, {'id': '570010017', 'name': 'מעונות עובדי קופת חולים ב\' בגבעתיים אגודה שיתופית בע"מ', 'registration_date': datetime.datetime(1950, 10, 23, 0, 0), 'phone': '03-6250814', 'primary_type_id': '43', 'primary_type': 'שיכון', 'secondary_type_id': '61', 'secondary_type': 'שיכון', 'legal_status_id': '10', 'legal_status': 'אגודה פעילה', 'last_status_date': datetime.datetime(1950, 10, 23, 12, 0), 'type': '', 'municipality_id': '6300', 'municipality': 'גבעתיים', 'inspector': 'בר-נתן יונה', 'address': 'נמצא אצל בוריס בריק, מצולות ים 14  גבעתיים 53486'})]
(sink): /usr/local/lib/python3.6/site-packages/jsontableschema/model.py:48: UserWarning: Class models.SchemaModel is deprecated [v0.7-v1)
(sink):   warnings.warn(message, UserWarning)

Ottoman Associations Scraping

The list of Ottoman Associations can be found here: https://foi.gov.il/he/node/1908

You need to write a pipeline that converts the Excel file into a simple tabular resource with 3 columns:

  • id
  • name
  • address

The pipeline should be named 'ottoman-associations' and be located under entities/

[Data] [Enrichment] Geocode addresses of entities

Add a processor that does some geocoding for addresses of entities.

  • Make sure requests are cached locally so that multiple requests for the same address will be resolved without hitting network.
  • Collect statistics on misses
  • Employ heuristics for improving the geolocation success rate without causing false positives

investigate discrepency in tender scraper number of rows

Logged number of rows from pipeline

scraper-office

add_publisher_urls_resource: INFO :Processed 10362 rows
dump.to_path: INFO :Processed 10362 rows

scraper-exemptions

add_publisher_urls_resource: INFO :Processed 94878 rows
dump.to_path: INFO :Processed 94878 rows

scraper-central

add_central_urls_resource: INFO :Processed 123 rows
dump.to_path: INFO :Processed 123 rows

count of rows in DB:

TENDER_TYPE COUNT
office 10,377
exemptions 94,898
central 1

[Data] [Quality] Debug problematic spending reports

There are about 300 spending reports coming from the government. Each one of them with a slightly different format and there’s a pipeline that’s supposed to consolidate them into one (it’s everything in procurement/spending).

But - it doesn’t always succeed and there are errors in the process:

  • sometimes because the original data is shit
  • sometimes because our processor misses some bits

there’s this report which is generated during the process: http://data.obudget.org/queries/1153/source

At the moment there are 59 problematic reports:

  • some are okay, but only a few rows are bad.
  • some are completely erroneous

The task is to:

  • Go over each of the bad reports
  • Understand what’s the problem there and
    a. fix the processor
    b. decide that the data is shit (unrecoverable)
    c. fix the data manually and add to the pipeline a ‘manual’ source

Also - record the decisions somehow (maybe incorporate it in the pipeline) so that in the future we can query only the reports which were not analysed.

[Data] [Quality] Research entity name resolution misses

We try to find a proper entity from the official registries when all we have is the name (e.g. in the supports dataset, exemptions and sometimes with contract spending).

Most of the time we succeed, but not always.

This task is to understand why we miss and improve the detection rate.

pulling code from remote master branch to origin master branch should not cause a merge commit

look at this merge commit - OriHoch@28bb9d8

it was done when I pulled latest changes from OpenBudget master branch to my master branch

my master branch didn't have any local changes, so it should be a fast-forward without a merge commit

the merge commit is for these 2 commits:

  • cf8095d - this commit is on OpenBudget master branch
  • f2899e6 - this commit is not on OpenBudget mastar branch

however, there is a commit which has the same title and time which is on OpenBudget master branch:
2dcd32e

not sure exactly what's going on here

@akariv - is it possible you are editing history?

should add exemptions (פטור ממכרז) to pipelines (based on the legacy tenders/exemption_scraper)

Exemption scraper

Exemptions are records for planned purchases of the government that are exempt from a tender process.

The website to scrape is quite slow and buggy, so we want to scrape frequently, and only get the most recent records (for example, scrape every few hours and get a day's worth of records).
(For initialisation there should also be a one-time-run pipeline that scrapes the entire website)

Data is written to a DB table (exemptions), with the update mode - so that records with the same publication id will be updated and data is appended and not replaced.

Scraped data includes the basic list of fields for each exemption record, as well as the list of documents for the record (which should be saved as a JSON object).

There might be issues regarding scraping from non Israeli IPs - talk to me if you encounter such issues.


Ref:
Legacy scraper folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy scraper is in tenders/exemption_scraper.py

tenders exemptions should parse signed documents

reproduction

expected

  • should get the document

actual

notes

Guidestar scarper: Add association endpoints

Build a pipeline to scrape the the following data from guide star association item page:

  • Number of workers (integer)

  • Number of volunteers (integer)
    (check this link for "No report" placeholder text example)

image

  • Activity location(s): (strings array)
    image

  • Annual cycle (integer/float):
    image

  • Salaeries (+names) - up to 5 positions (position (string), amount (integer/float)) :

image

[Data][Processing] Analyize support criteria titles to extract extra properties

We scrape a list of support criteria documents (מבחני תמיכה).

We want to analyze the title of these documents (which is usually in a very similar template) and extract properties such as:

  • Funding period
  • Amendment yes/no
  • Purpose of the support
    etc.

Tasks:

  • Take the list of titles from the query http://data.obudget.org/queries/1203/source
  • Write some Python code to analyze the titles and extract the different attributes out of them
  • Incorporate your code into the pipeline at supports/criteria

Sample of such titles:

  • תיקון למבחנים של משרד הבריאות לצורך תמיכה בבתי חולים ציבוריים במצוקה
  • מבחנים לצורך תמיכה של משרד החינוך העוסקים בהפעלת מתנדבים במערכת החינוך
  • מבחנים לצורך תמיכה של משרד החינוך בפעילות שוטפת של כפרי סטודנטים
  • מבחנים לחלוקת כספים לצורך תמיכה של המשרד להגנת הסביבה במוסדות ציבורי העוסקים בטיפול דחוף בקופים הנמצאים בסכנת קיום
  • תיקון למבחנים לחלוקת כספי תמיכות של משרד הרווחה והשירותים החברתיים לשיפורי מיגון ובינוי לצורך מיגון מסגרות רווחה חוץ–ביתיות
  • מבחנים לתמיכה של משרד התרבות והספורט במוסדות המשמשים כארגון גג בתחום התרבות
  • מבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט במוסדות ציבור המקיימים פסטיבלים בתחום אמנויות הבמה
  • מבחנים לתמיכה של משרד התרבות והספורט למוסדות ציבור בתחום יוצרי מחול עצמאיים, פרויקטים בתחום המחול ומרכזי מחול
  • תיקון למבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט במוסדות תרבות העוסקים בהוראת האמנויות
  • מבחנים למתן תמיכות של משרד העלייה והקליטה למוסדות ציבור המבצעים פעולות לקליטה בקהילה
  • תיקון למבחנים לחלוקת כספי תמיכות של משרד התרבות והספורט למוסדות ציבור בתחום המוזאונים המוכרים
  • תיקון למבחנים לחלוקת כספי תמיכה של משרד התרבות והספורט בספריות המיועדות לעיוורים ולליקווי ראייה
  • תיקון למבחנים לתמיכה של משרד התרבות והספורט בתחרויות מוסיקה בין–לאומיות
  • תיקון למבחנים של משרד הבריאות לצורך תמיכה בהוצאות הניתוחים והאשפוז של קופות החולים המבצעות תכנית לקיצור תורים לשנת התקציב 2016
  • תיקון למבחנים למתן תמיכות של משרד הרווחה והשירותים החברתיים למוסדות ציבור
  • תיקון למבחנים למתן תמיכות של משרד הרווחה והשירותים החברתיים למוסדות ציבור המפעילים תכניות למתנדבים השייכים לאוכלוסיות הרווחה לצורך שיקומם
  • תיקונים למבחנים לחלוקת כספים לצורך תמיכה של משרד החינוך במוסדות תורניים -לימוד ופעולות
  • מבחנים לצורך תמיכה של משרד החינוך בקונסרבטוריונים
  • מבחן תמיכה של משרד החינוך במוסדות ציבור העוסקים בתחום מורשת כוחות המגן והמחתרות שפעלו בתקופה שקדמה להקמת המדינה
  • מחנים למתן תמיכות של משרד התרבות והספורט במכוני מחקר תורניים בעלי חשיבות לאומית

[Data] [Infra] procurement-tenders-exemptions should update existing items (periodically or by other logic)

reproduction

  • procurement-tenders-exemptions pipeline runs daily and inserts new exemption (after #13 is merged)
  • after a week an exemption's details are updated

expected

  • the updated exemption's details should be updated

actual

  • only new exemptions are inserted (based on whether they exist based on ID)

notes

  • should consider some logic for how / when to update
  • @akariv suggested a solution based on the exemption status:
    • This will allow in the future more correct behaviours, such as periodically re-scraping records that are still "in progress" (i.e data might still be changing) and not scraping records that have been concluded.

  • a solution which doesn't rely on the underlying data -
    • have a column of last scrape date in DB
    • on every daily run (or a different, maybe hourly run) - update a limited number of exemptions which were updated the earliest
    • on every updated exemption the last scrape date is updated
    • that wait, it will slowly go over all exemptions and update them

Monetary Change transaction grouping

Monetary Change transaction grouping

Monetary changes are records of single changes to 3rd level budget items (a.ka. "6-digit" items).
During the year, the national budget is modified, and these changes represent individual modifications (Some of these changes are brought to the finance committee of the Knesset to be approved, some don't).

We download the data and store it in a DB (see here: http://data.obudget.org/queries/1061/source)

The pipeline for the downloading and processing the data is here: https://github.com/OpenBudget/budgetkey-data-pipelines/tree/master/budgetkey_data_pipelines/pipelines/budget/national/changes/original

All changes are part of a request - the request ID is uniquely identified by the (leading_item, req_code) fields. Those requests that are brought to the committee for approval, also have a non-zero committee_id.

Some of these requests are part of a larger transaction - for example, when moving funds from one ministry to another, we'll see a request for moving funds from ministry A to the general reserve and another request for moving the same amount from the general reserve to ministry B.

The goal of this issue is to assign a 'transaction id' to all changes - the same transaction id should be assigned to all changes that belong to the same transaction.

You should create a pipeline and processors in budget/national/changes/processed/
Pipeline should be called 'transactions' and should generate a 3 column file:

  • leading_item
  • req_code
  • transaction_id
    with one row per transaction_id.

Ref:
Legacy processing code folder: https://www.dropbox.com/sh/xan213ilqrdh43p/AAC1tSoTiH1rnMOue7pBJyhHa?dl=0
Legacy algorithm is in tenders/extract_change_groups.py

should ensure same version of datapackage-pipelines dependency runs on all environments

preconditions

  • datapackage-pipelines version was published to pypi
  • but, docker file was not updated - so contains previous version
  • however - code in budgetkey-data-pipelines relies on the latest version and breaks on previous datapackage-pipelines version

reproduction steps

  • run the budgetkey-data-pipelines docker
expected
  • docker should run with the latest pypi version which the code was tested against
Actual
  • code fails because it runs with previous datapackage-pipelines version from docker
  • we pull latest docker image which has the previous datapackage-pipelines version
    • FROM frictionlessdata/datapackage-pipelines:latest
  • budgetkey-data-pipelines is installed like this:
    • RUN sudo pip install -e /
  • setup.py doesn't contains the datapackage-pipelines dependency, so won't upgrade it (no --upgrade param)
  • end up with previous datapackage-pipelines version

Social Map save past data

update data structure to include past data (archive) per organization. We need to have the archive of past data (in year resolution) of the followings:

  • annual turnover
  • number of workers
  • number of volunteers
  • Nihul Takin
  • Salaries

We can integrate the archiving process in real time (what the specific entry is being updated) or once a month. It would be better to keep a whole snapshot of all relevant entries in a certain point of time, to avoid misunderstandings and adjustments in data interpretation.

Government Company Registrar

http://mof.gov.il/GCA/CompaniesInformation/Pages/default.aspx
http://mof.gov.il/GCA/CompaniesInformation/Pages/CompanyInfo.aspx?k=483

  • Build a pipeline for scraping the list of government owned companies: name, details page URL
  • Build a dependent pipeline to scrape the detail pages: any information that can be extracted from there, e.g.:
    • purpose,
    • officers,
    • directors,
    • etc.
  • Build a dependent pipeline that reads the activity summary PDFs and extract the text within them. Especially, the registration id of the company, ownership in other companies etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.