sledilnik / data Goto Github PK

View Code? Open in Web Editor NEW

20.0 8.0 43.0 209.43 MB

Collecting and organising COVID-19 data for Slovenia as they come in from various sources

Home Page: https://covid-19.sledilnik.org/en/data

License: GNU Affero General Public License v3.0

Shell 0.19% Python 2.26% Perl 0.15% HTML 97.40%

covid19-data slovenia slovenija covid19-tracker covid-19-tracker covid-data-project covid-dataset covid-19 covid19

data's Introduction

Slovenia COVID-19 Data Collection - Sledilnik.org

Disabled/obsolete workflows:

Visualized at COVID-19 Sledilnik Home Page

Collecting and organising data as they come in from various sources.

This repository is for organising our collaboration better: wikis, issues etc.

Python 3.8+ is required to run scripts in this repo.

Vaccination update depends on py-cepimose with a specific subset of

How to run scripts

In this folder run:

python3 -m venv venv or virtualenv -p python3.8 venv
source venv/bin/activate
pip install -r requirements.txt
export COVID_DATA_PATH=<the location of the COVID-DATA folder>
python update.py or python transform/nijz_daily.py or python transform/nijz_weekly.py...

Updating data

Most GitHub:octocat: workflows are scheduled to be ran periodically and can also be triggered manually on the Actions page.

Changelog

2020-04-28

stats.csv: rename cases.active.todate to cases.active issue #11

2020-04-26

stats.csv: added tests.regular.* and tests.ns-apr20.* to separate tests for National Survey April 2020
stats.csv: changed tests.positive.* to report positive actual tests (slightly higher than cases.confirmed.*)

2020-04-25

dict-municipality.csv: fixed region for Gornja Radgona (was lj - is ms now)

2020-04-20

dict-age-groups.csv: age groups with population (total, male, female)

2020-04-18

dict-retirement_homes.csv: added tax-id for each retirement home

data's People

Contributors

Stargazers

Watchers

Forkers

zmucj matjaz7 treker-mk misssko gabrielnoiret martinvid mojca romunov sacra5 mpreitano popendekl nejc835 trnsik tehnicni pamz123 rabbitsecurity majazaloznik majranca youngstats roknslo boneseva uroszdesar mkuhar atomicmind marjeta42 rokmlakar katjaspehar uroslesi klitomaz ikolar kevp0 l5d1l5 kuhartim abdullah-malek dillnw prachi264 saivamsikalava kumarkb1 borc23 jayedrafiprojects lobicode

data's Issues

EPI: XLS to GDocs

S 15.10. je NIJZ prešel iz DOC poročila na XLS poročilo.

Podatke trenutno ročno vnašamo v GDocs in sicer v naslednje tabele:

Podatki: skupno število potrjenih primerov (tb1)
EPI: po starostnih skupinah - potrjeno okuženi (tb4), umrli (tb6)
Kraji: potrjeno okuženi po občinah (tb2)
Umrli:Kraji: umrli po občinah (tb6)

Skripta bi lahko ta polja dodala v GDocs + skopirale dodatne formule.

Za več informacij: @lukarenko, @kesma01 ali @matejmeglic

Sewage.csv: remove sparse rows

As NIB data is very sparse, we should filter out all the rows with zero data

NIJZ XLS: age-confirmed.csv, age-deceased.csv

From NIJZ XLS process the regions data from Tabela 4 to new age-confirmed.csv with following columns:

age.female.0-4.todate | age.female.5-14.todate | age.female.15-24.todate | age.female.25-34.todate | age.female.35-44.todate | age.female.45-54.todate | age.female.55-64.todate | age.female.65-74.todate | age.female.75-84.todate | age.female.85+.todate | age.female.todate 

age.male.0-4.todate | age.male.5-14.todate | age.male.15-24.todate | age.male.25-34.todate | age.male.35-44.todate | age.male.45-54.todate | age.male.55-64.todate | age.male.65-74.todate | age.male.75-84.todate | age.male.85+.todate | age.male.todate

age.unknown.0-4.todate | age.unknown.5-14.todate | age.unknown.15-24.todate | age.unknown.25-34.todate | age.unknown.35-44.todate | age.unknown.45-54.todate | age.unknown.55-64.todate | age.unknown.65-74.todate | age.unknown.75-84.todate | age.unknown.85+.todate | age.unknown.todate

age.0-4.todate | age.5-14.todate | age.15-24.todate | age.25-34.todate | age.35-44.todate | age.45-54.todate | age.55-64.todate | age.65-74.todate | age.75-84.todate | age.85+.todate | age.todate

After processing, we should add copied row also to age-deceased.csv (similar as we do in deceased-region.csv in transform/region.py). This is to have matching last day in both CSV files.

municipiality.csv: first column does not have "date" header

I have added it manually, but fix export please!

Windows: nijz_daily.py fails on Windows

@AuroraBode is using Windows to run nijz_daily. Script fails with error on ajdovščina so the suspect is encoding issue.

INFO:C:\Users\Delo\Documents\GitHub\data\transform\nijz_daily.py:SOURCE_FILE: C:\Users\Delo\Documents\COVID-DATA\EPI\dnevni_prikazi20210225.xlsx
Traceback (most recent call last):
  File "C:\Users\Delo\Documents\GitHub\data\transform\nijz_daily.py", line 62, in <module>
    df = df.rename(mapper=get_municipality_header, axis='columns')  # transform of municipality names
  File "C:\Users\Delo\Documents\GitHub\data\venv\lib\site-packages\pandas\util\_decorators.py", line 309, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Delo\Documents\GitHub\data\venv\lib\site-packages\pandas\core\frame.py", line 4300, in rename
    return super().rename(
  File "C:\Users\Delo\Documents\GitHub\data\venv\lib\site-packages\pandas\core\generic.py", line 947, in rename
    new_index = ax._transform_index(f, level)
  File "C:\Users\Delo\Documents\GitHub\data\venv\lib\site-packages\pandas\core\indexes\base.py", line 4836, in _transform_index
    items = [func(x) for x in self]
  File "C:\Users\Delo\Documents\GitHub\data\venv\lib\site-packages\pandas\core\indexes\base.py", line 4836, in <listcomp>
    items = [func(x) for x in self]
  File "C:\Users\Delo\Documents\GitHub\data\transform\nijz_daily.py", line 52, in get_municipality_header
    region = municipalities[m]['region']
KeyError: 'ajdovščina'

NIJZ weekly: Deceased DSO/other/total by day into rh-deceased.csv

We should export NIJZ weekly deceased data from Tabela 3 into deceased-type.csv in the form:

date
deceased.rhoccupant.todate (calculate totals)
deceased.other.todate (calculate totals)
deceased.todate (calculate totals)

nijz_daily.py: vaccination.csv support

We should merge Tabela 2 from NIJZ report (1. and 2. dose) with daily delivered data from Vaccination GSheet.

This should replace manual copying of daily vaccination data from XLS to GSheet.

My suggestion is to:

move delivered data to E:Delivered
pull E:Delivered from GSheet + Tabela 2 (daily vaccination data)
calculate .todate and .used.todate fields

Daily NIJZ to tests-cases.csv (daily update from morning partial update)

This is continuation of #60, which happens around 13-14h, when daily NIJZ report is available.
As described, morning processing adds tests.* and placeholders for cases.* in stats.csv.

When we get NIJZ daily report, we can modify tests-cases.csv by doing the following:

copy Tabela 1 column N to cases.confirmed.todate and recalculate cases.* formula
use Tabela 5 column B to calculate todate numbers for cases.rh.occupant.confirmed.todate andrecalculate cases.* formula

test_list_xlsx: one file is missing

def test_list_xlsx():
    actual = list_xlsx(dir=DATA_DIR)
    expected = [
        'health_centers_tests/data/HOS/Bolnišnice COVID 12052020.xlsx',
        'health_centers_tests/data/HOS/2020-04/Bolnišnice COVID 30042020.xlsx'
    ]  # this should be length of 3
    for a, e in zip(actual, expected):
        assert a.endswith(e)

Consolidate naming of NIJZ files

NIJZ does not name their XLS file consistently: the most obvious change is that they use _ sometimes and not other times. The nijz_{daily|weekly}.py script then processes wrong file due to order of files in COVID-DATA/EPI folder.

Currenly we workaround this manually by renaming files to fix the sort order and re-run the script.

Thread where discussion started:
https://sledilnik.slack.com/archives/C01DV090C0G/p1611147960016200

Dockerization

Windows: update.py corrupted čšž in safety_measures.csv

@AuroraBode is using Windows to do daily updates. It seems that there is issue with encoding as update.py script on Windows machine corrupts safety_measures.csv on export

3561eda

Tests to stats.csv morning update

Every morning around 9:00 we get only two numbers from NIJZ:

tests.regular.performed
tests.regular.positive

This data is entered into Tests GSheet where we have historical data about test per-lab. GHeet is exported to lab-tests.csv via update.py.

These data is also used to fill-up stats.csv via legacy Podatki GSheet.

We should introduce new tests-cases.csv which is made of:

old legacy data from stats-legacy.csv (which should be static file from now on.
newly created data from lab-tests.csv (`tests.regular.performed, tests.regular.positive``) - all other fields are calculated
cases.confirmed.todate is simply calculated as previous day data + tests.regular.positive) - all other fields are calculated

Kako pogosto se posodobijo CSV datoteke na Githubu?

Smiselno bi bilo, da se posodobijo samodejno večkrat na dan, da so sinhronizirane z Google Docsom.

Drugače pa hvala za ta projekt. Vaše podatke uporabljamo na https://ustavimokorono.si/.

HC: missing 30.4. for SB CE and NM

There is no data for some days:

before 13.4.
13, 15, 17, 18, 23, 25, 30.4.

CSV rename

region-confirmed.csv (change name from regions-cases.csv)
region-active.csv (change name from active-regions.csv)
region-deceased (as is)
region-cases.csv = join all three region tables (not municipalities cases as is (?))
municipality-confirmed.csv (currently regions.csv)
municipality-active.csv (currently inside municipality.csv)
municipality-deceased-legacy.csv (currenlty inside municipality.csv, no changes after 12.12)
municipality-cases.csv = join all three municipality tables (like now, but censor deceased after 12.12)

HOS: XLS v GDocs

HOS poročilo dobivamo dnevno v XLS obliki.
Od 1.10. ga zbirajo preko aplikacije, zato je tudi format XLS postal stabilen, ker ni več ročnih vnosov.

Podatki se trenutno ročno vnašajo v GDocs HOS tabelo, posredno pa polnijo Pacienti tabelo, ki se uporablja za export v patients.csv.

Ideja:

dnevni HOS XLS processing, ki doda novo vrstico v HOS tabelo (lahko tudi novo vrstico z formulami v Pacienti)
še vedno mora nekdo pregledati zadevo, zaradi neskladij pri sprejeti/odpuščeni in uskladitve z ICU poročilom/tabelo

Več informacij o HOS: @lukarenko ali Maja Založnik

age-deceased.csv from NIJZ weekly report

NIJZ weekly report - umrli - deceased - XLSX report
has deceased data by-date for regions and ages.

We should generate age-deceased.csv from Tabela 5

We can generated regions-deceased.csv from Tabela 4

Implement support for OurWorldInData

The data can be fetched from their Github repository: https://github.com/owid/covid-19-data/tree/master/public/data - it is available both in CSV and JSON formats.

The existing use cases we need covered (@joahim please verify this):

Get data for all dates for a specific list of countries.
Get data starting from a specific date, for a specific list of countries.
Get data starting from a specific date, for all countries.

Countries are specified by their ISO codes.

Data columns we currently need:

"date"
"iso_code"
"new_cases"
"new_cases_per_million"
"total_cases"
"total_cases_per_million"
"total_deaths"
"total_deaths_per_million"

The format of the request should be something simple, no need for some kind of a generic query mechanism. I would suggest offering following parameters (whether in query string or input JSON, I don't know):

List of country ISO codes. If not specified, data for all countries is returned.
Starting and ending dates (both are optional).
Output format (requested by @joahim): JSON or CSV.

Also, please leave an option for the null properties in JSON to be skipped (not right now, since our parser currently doesn't support it - I think).

cc: @joahim, @MihaMarkic, @lukarenko, @stefanb

stats.csv: rename cases.active.todate to cases.active

We have now added estimated recovered, closed (recovered+deceased) and active cases.

cases.confirmed.todate
cases.recovered.todate = cases.confirmed.todate(today) - cases.confirmed.todate(-21days)
cases.closed.todate = cases.recovered.todate + state.deceased.todate
cases.active = cases.confirmed.todate - cases.closed.todate

As cases.active is is current (today's) state, we should remove wrong .todate suffix.

Note: as this fields were not used much before, we do not expect any breakage.

Automation of daily stats.csv processing with NIJZ daily report

stats.csv is currently exported directly from old GSheet

We update stats.csv three times during the day as the data becomes available:

1. LAB data (9:00 update)
The data is entered in Tests GSheet and exported to lab-tests.csv with update.py script

After export, we need to add to stats.csv previous day (data is added to existing row):

tests data: tests.*columns D-M (N-R are not used anymore and left empty)
active cases: cases.* columns S-AC (S value (active cases) is calculated as S(day-1)+Positive(day))

2. HOS + ICU + deaths (10:30 update)
Patients data is collected in Patients GSheet and is exported to patients.csv via update.py script.

After export, we need to add to stats.csv current day (new row is added):

patients data: state.* columns AL-AS

3. EPI data (final update - around 13:30)
New NIJZ report in XLS is computer generated and ready to be parsed automatically and converted to individual .CSV files.

The following data needs to be extracted:

Municipalities confirmed: transform/regions.py script creates regions.csv (confirmed), active-regions.csv and deceased-regions.csv (just placeholder, as NIJZ does not provide this data anymore)
Region data: need new transform script - see #49
Age data: need new transform script - see #50
DSO data: need new transform script - TBD

When we have all above data, we can merge all these CSV files to stats.csv:

region.* from new regions-cases.csv
age.* from new age-confirmed.csv
deceased.* from new age-deceased.csv
cases.* from new cases.csv

NIJZ XLS: regions-cases.csv

From NIJZ XLS process the regions data from Tabela 3 to new regions-cases.csv with following columns:

region.lj.todate
region.ce.todate
region.mb.todate
region.ms.todate
region.kr.todate
region.nm.todate
region.za.todate
region.sg.todate
region.po.todate
region.ng.todate
region.kp.todate
region.kk.todate
region.foreign.todate
region.unknown.todate
region.todate

stats-weekly.csv automated processing with NIJZ weekly report

NIJZ publishes two weekly reports on Monday:

We should automate processing of XLSX into stats-weekly.csv.

Currently we have GSheet for weekly data.

Existing data can be processed from:

week | date | date.to

obvious

week.confirmed | week.rhoccupant

umrli - Tabela 1

week.investigated

okuzeni - Tabela 1 (Skupaj)

week.healthcare

okuzeni - Tabela 4

week.src.import | week.src.import-related | week.src.local | week.src.unknown

okuzeni - Tabela 1

okuzeni - Tabela 2

week.from.<country>

okuzeni - Tabela 3 - just needs to be rotated

Additionaly, we can add additional data:

week.deceased | week.deceased.rhoccupant

umrli - Tabela 2