sfoe / ogd_qualitychecks Goto Github PK

0.0 2.0 0.0 102 KB

Automated pipeline to check OGD data quality using Frictionless

Python 100.00%

ogd_qualitychecks's Introduction

CSV Quality Check with Frictionless

This Streamlit app enables users to perform quality checks on CSV files of SFOE OGD publications using the Frictionless framework. The app allows for easy validation of CSV files against defined schemas (extracted from corresponding datapackage.json) found on uvek-gis, ensuring data integrity and adherence to specified formats.

You can find the app here

Feautres

File Upload: Users can upload CSV files for validation directly through the Streamlit interface.
Schema-based Validation: Leveraging Frictionless, the app validates uploaded CSV files against predefined schemas or inferred data structures.
Error Handling: Error messages are provided in case of validation issues, aiding users in understanding and rectifying data problems.

Files

app.py: This is the main Python file containing the Streamlit application code.
mapping.py: This file contains the mapping from file name to OGD number. Note: There is not a datapackage for every SFOE OGD publication.
requirements.txt: Lists the Python dependencies required for running the Streamlit app.
.streamlit/config.toml: This is the Streamlit configuraion file. It allows customization of Streamlit's behavior, including server settings, theming, and other preferences.

Problems

The structure of the code is as follows: the CSV content is loaded in a pd.DataFrame and it's validated with frictionless.validate using the corresponding schema from the datpackage.json. While testing we found out that even if the CSV is valid, it is recognized as invalid if all the columns are of datatype int64. But if you change the column dtype of a random column to float, The

Attempts to improve it:

Store the CSV file "locally" using the path file_name.csv and access & use it with the datapackage. failed: "Source for frictionless.validate() is empty"
Save the CSV file in a temporary directory and use the datapackage. failed: "frictionless.validate() cannot access source form unsecure origin"

ogd_qualitychecks's People

Contributors

Watchers

ogd_qualitychecks's Issues

Step 7: automate

Goal: automate the process to run automatically.

I see two options here: one very easy and another one more efficient but more complex

Easy one: Configure the github action to run twice a day (12:00 and 22:00, for example)
Complex one: make the github action monitor the staging folder and trigger whenever something is changed. --> Please not that I am not even sure this is possible

Step 1: find the list of datafiles to handle

Goal: to have a list of all the CSV files stored in the staging folder

Folder to check:
https://www.uvek-gis.admin.ch/BFE/ogd/staging/

Step 4: Test the data

Goal: using the official frictionless github action (look around in the frictionless documentation or take my repo as example) use the datapackage to test the data in the staging area

Documentation

This has no "step" since it should be an ongoing task.

Goal: to have the code and the process properly documented.
Deliverables are:

meaningful comments in the code
A meaningful readme in the repo
Badges in the opendata.swiss metadata entries, for show!

Step 6: feedback

Goal: to send a meaningful feedback via email to he right people

If everything OK and the data is published, you send an email just mentioning the data and that t was published
If the data is NOK, you send a mail including the frictionless report link, telling the responsible person that they should check and fix the CSV

For the moment being, the process can send a mail to you (please parametrize the address, but you can set the value to something fixed for the momet being).
It would however be nice ho think about the next step. Namely: how can we get the right emails? Is it altogether possible?
You can open an issue for this one if you want.

Step 5: Move the data

Goal: Move the data to the production folder or simply delete it from staging according to the test's results

If the test is OK, you move the data from staging to the correct production folder.
If the test is NOK, you store the result link somewhere and you delete the incorrect file from staging

Step 2: extract OGD-ID

Goal: from the CSV in hte list, extract the ID from the filename.

The OGD CSVs have an "ogdxxx_" prefix, where "xxx" is a number consisting of 1-3 digits. This is the internal OGD ID and also used as name of the final folder where the data is stored for publication.
Example: https://www.uvek-gis.admin.ch/BFE/ogd/6/ for ogd6_kev-bezueger.csv

Step 0: setup

Goal: have a properly setted up work environment and developing process

What I would like is to use this very small project as an exercice for you to use Git(hub) with proper commits, branches, pull requests (PR), code reviews and merges. Feel free to ask for details and/or read a guide or two about the mechanics.

The process how I foresee it (open for discussion in this ticket):

You handle the tickets one after the other, either following my numbering of proposing a new order if you find it better
For each ticket, you write your code and commit it regularly on a meaningfully named branch using meaningful commit titles (short) and comments (longer).
Once happy with the results and you consider the ticket to be solved, you post a PR and assign it to me, mentioning the ticket it will solve (use a proper github link)
I (and possibly Martin) review the code, make suggestion or comments
Once the code is OK, you do a very short (5 minutes) live demo, demonstrating the feature
The code is accepted (by me and/or Martin) and then merged into master (by yourself)

Deliverables for this ticket:

your comments/suggestions about the process. This should be good for you.
a github action you can develop on where the various steps can be executed. The action should be triggered manually for the moment being

File names - handling hedge cases

The function needs to be adjusted such that it can handle cases, where the format isn't ogdXX_abc.csv.

Step 3: Associate Data and datapackage

Goal: to use the right datapackage.json for testing the right data

(almost) each OGD publication has a corresponding datapackage.json in the publication folder. This datapackage is used to test the formal quality of the published data.
Using the information from the previous steps, you need to select the right datapackage for the data you are currently testing.
Please notice that the datapackage contains a "path" attribute with the URL of the CSV to be tested
"path": "https://www.uvek-gis.admin.ch/BFE/ogd/6/ogd6_kev-bezueger.csv",
This URL is however NOT what you want (you want to test the CSV in the staging folder). You then need to either store a temporary copy of the datapackage with a modified path, or find a way to modify the path at runtime