bcgov / nr-rfc-climate-obs Goto Github PK

1.0 3.0 0.0 486 KB

Transition of the existiing climate observations data pipeline to enable running off prem

License: Apache License 2.0

Python 41.02% Dockerfile 1.28% R 3.52% Smarty 0.51% Shell 0.02% Jupyter Notebook 52.78% Batchfile 0.81% PowerShell 0.07%

lwrs mof nrids rfc

nr-rfc-climate-obs's Introduction

nr-rfc-climate-obs

Work related to re-building the climate observations data pipeline so that it can run in a variety of different environments.

Objective is to move as much of the existing R code as possible with minimal changes.

The general architectural pivot is moving away from using Shared File and Print (SFP) to using object storage for persistence of artifacts created by different data aquisition / processing scripts.

Some teams choose to use the word main for the default branch of a repo as opposed to the more commonly used master branch. While it's not required or recommended, your team is empowered to do what works for them. If you do rename the master branch consider using main so that we have consistency among the repos within our organization.

Preferred Phrasing

Non-Inclusive		Inclusive
Whitelist	=>	Allowlist
Blacklist	=>	Denylist
Master / Slave	=>	Leader / Follower; Primary / Standby; etc
Grandfathered	=>	Legacy status
Sanity check	=>	Quick check; Confidence check; etc
Dummy value	=>	Placeholder value; Sample value; etc

Pro Tip 🤓

This list is not comprehensive. If you're aware of other outdated nomenclature please create an issue (PR preferred) with your suggestion.

fix failing asp job

ASP data download is failing:

see: https://github.com/bcgov/nr-rfc-climate-obs/actions/runs/9571094612

Deploy Fireweather Stations Pipeline to openshift

Recently became aware that the bcgov residency in github only has access to 20 github runners. As the load increases this will have impacts on the timing of jobs. For this reason looking to pivot to running the pipeline in openshift.

this ticket will see the python script that does the data aquisition and uploading to object storage as well as the R code that does the transformation of the xml files to csv, run in openshift.

DoD:

helm chart that deploys the kubernetes cron jobs
Investigate if the helm chart can be configured so that the two jobs have dependencies on one another. I.E. the download process runs and then calls the R script job.

To be completed, but not as part of this ticket:

kibana dashboard / query that allows easily identify if job succeeded or not

relates to: #21

weatherTimestamp not getting calculated correctly by main_fwx

the fire weather stations weather time stamps are being recorded as 2023112721, 2023112722, 2023112799. The one that is 2023112799 should be 2023112723.

Figure out what's going on with this and correct the script, then retroactively generate new data.

Create GHA to injest MPOML Data

This will be a simple pipeline that will pull the data from the federal gov datamart to a location in object store that emulates the location being used in the on prem pipeline.

The timeing for now will be the same as the schedule used on prem

Fix the ZXS script

the ingestion spreadsheet is going to look for a specific date for the ZXS data. Currently the ZXS data acquisition code is only creating a single file and then overwriting that file.

What ZXS script should do:

create a folder in object storage 'ZXS//ObsTephi_12_CZXS-2014-01-21.csv'
ditto for local storage of the data.
Don't think its necessary to maintain a the current file with just ObsTephi_12_CZXS.csv in the ZXS directory.

Write Climate Obs directly to the XL SS

Overview

Idea with this ticket is to move the logic that is currently contained in the xl ss import macro and the R climate_obs app to python, so that we can download the data and write it into the SS all in one go.

Steps

For all the various input data sets:

ASP - Automated Snow Pillows
ECCC - Environment Canada weather stations
ZXS - Vertical temperature profiles
F_WX - BC Wildfire Weather Stations

For each dataset:

iterate over the various stations
Look up the stations name with the column header from the METADATA tab in the xl spreadsheet.
Copy the data for the current data / time into the ALL_DATA sheet.

Missing Data
How this is going to work hasn't been finalized at this time

One idea, is using the logic that was used in the R data ingestion script, where it looks at stations that are close by, and calculates the relative temperature differences between the two stations, and then using that information interpolates the missing values.

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means it's critical that we work to make our content as discoverable as possible. Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

That's it, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip 🤓

If your org is not in the list below, or the table contains errors, please create an issue here.
While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.
Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.
If your application is live, add the production URL.

Ministry Short Codes

Short Code	Organization Name
AEST	Advanced Education, Skills & Training
AGRI	Agriculture
ALC	Agriculture Land Commission
AG	Attorney General
MCF	Children & Family Development
CITZ	Citizens' Services
DBC	Destination BC
EMBC	Emergency Management BC
EAO	Environmental Assessment Office
EDUC	Education
EMPR	Energy, Mines & Petroleum Resources
ENV	Environment & Climate Change Strategy
FIN	Finance
FLNR	Forests, Lands, Natural Resource Operations & Rural Development
HLTH	Health
IRR	Indigenous Relations & Reconciliation
JEDC	Jobs, Economic Development & Competitiveness
LBR	Labour Policy & Legislation
LDB	BC Liquor Distribution Branch
MMHA	Mental Health & Addictions
MAH	Municipal Affairs & Housing
BCPC	Pension Corporation
PSA	Public Service Agency
PSSG	Public Safety and Solicitor General
SDPR	Social Development & Poverty Reduction
TCA	Tourism, Arts & Culture
TRAN	Transportation & Infrastructure
WLRS	Water, Land and Resource Stewardship

NOTE See an error or omission? Please create an issue here to get it remedied.

Determine how jobs will be linked

There will be 4 separate jobs that need to be run before the R based data ingestion script can be run.

Need to determine how to deal with convergent jobs.

Some options:

Implement individual jobs as DAGS in airflow pipeline and the the R based data get triggered when the 4 data prep jobs are complete
Evaluate if a trigger can be setup using GHA that will only progress if all 4 jobs are complete.
Staggered cron jobs (possibly as initial option, but really want to avoid this option)
Other???

Integrate Running of COFFEE Model

This model could likely be defined inside of a GHA

Need to determin how the triggering will take place.

Would pull the dependent data down to local file system, then figure out a way to trigger the excel macro that initiates the coffee model.

When complete push the data back up to object storage.

Document the Jenkins object storage sync process

A jenkins process has been setup to sync data between object storage and our on prem SFP. The jenkins pipeline config is located here:
https://github.com/bcgov/nr-rfc-climate-obs/blob/main/scripts/bat/ostore_push.jenkins

Jenkins job config is here: https://apps.nrs.gov.bc.ca/int/cron/job/RIVER_FORECAST_CENTER/job/climate_obs_rsync

Should document somewhere (maybe the rfc wiki) how this works. Could be part of a larger ticket where we document how the various github based flows originate from.

Aquire ECCC Hourly Date

Create a script that will pull the following information on an hourly basis.

Source of data:
https://hpfx.collab.science.gc.ca/20231101/WXO-DD/observations/swob-ml/20231101/

Data Aquisition

need to figure out what weather stations we want to keep and which we do not
- do bounding box (future create a bc buffered polygon that we can query)
- For each weather station grab the lat longs and determine if they are in our area of interest.
- if so then proceed to processing

Processing

Listed in the climate_obs spreadsheet to get the station list
pull down the station data for the current hour (note hours in the file names use UTC)
Extract from the individual xml files the following properties:
- pcpn_amt_pst1hr
- avg_air_temp_pst1hr
If a new day is detected then create a new file, otherwise pull the existing file from object store update it and repush (make sure we are not creating new versions)
create 2 different input files one for temperature and another for precip.
- PC.csv
- TA.csv
format of the files / columns:
- date
- climate stations (listed along the x axis like the PC.csv ASP data)
- actual data (either precip. or temperature depending on which file is being created(
Script would run hourly when the data is available
Would pull the data down and update it, and then repost. (make sure we are not creating a new version in object storage when file is updated)
Need to setup a sync process that will ensure the data that exists in object store also exists on prem server.
- on prem file path: Z:\MPOML\HOURLY (sewer)
- object store path: RFC_DATA/ECC_HOURLY/

Secondary:

listen to the message queue for the specific data we want and trigger the github action

Convert ZXS Job from kubernetes job to gha

Background

When this work started was running into issues related to shortage of github action runners. At that time only 20 runners were available to github actions running in the bcgov org. Recently that number has been increased to 500. Deploying data acquision jobs as GHA is significantly easier than as kubernetes jobs. For that reason am pivoting most jobs back to run as GHA's.

Task

Modify the existing ZXS job to run as a GHA.

Transition from cron based schedule to Event based Triggering

Transition the schedule based triggers for the 4 input datasets to the climate obs pipeline to an event based system using an AMQP listener. Take the work that was created in the CMC Grib download repo and re-use.

Input datasets / pipelines:

Automated Snow Pillow Data (ASP)
MPOML data for today/yesterday
BC Wildfire weather stations
ZXS temperature data

Hatfield Snowpack Enhancements

enhancements to the snowpack code to make it more maintainable

There are some tasks in here at the time of creation, however its expected that as ticket getting snowpack script running is being worked on, it will identify new issues for this ticket

ticket https://github.com/bcgov/nr-rfc-admin/issues/8 will attempt to defer any tasks it can to this epic that are not required to get the script working

Address ECCC version data in Object Storage

Background

The ECCC script pulls hourly data from the federal governments data mart, does some reformatting and ultimately creates the files in the object storage bucket into the following directory: RFC_DATA/ECCC/hourly/csv

The script is currently running every hour. Each time it runs it creates a new version in object storage.

Task

Modify the ECCC code and update it so that there can only ever be two versions. If there are more than two versions the oldest ones are autodeleted.

The best place to implement this is the upstream nr-objectstore-util lib. Configure it so that there is an argument for the put operations that defines the maximum number of versions you want to maintain. If not populated then doesn't do anything, and just creates a new version, however if you specify an arguement of version=2 then it will delete any versions that are older than the 2 newest ones.

Set up sync of PVDD data between objectstore and drain

Set up a sync between the PVDD data files in dischargeOBS/PVDD on objectstore and \DRAIN.dmz\Shared\Real-time_Data\PVDD

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: renovate.json
Error type: Invalid JSON (parsing failed)
Message: Syntax error: expecting end of expression or separator near ]
"igno

Configure automatic run of XL based climate obs data prep

an R based shiny app has been created to review / fix / the climate observations data. Need to:

identify the code used to prep the Xl Spreadsheet
Modify the paths
Containerize
Integrate the running of this code as a GHA.

Significant amount of work to accomplish this task. Another task will also be defined that will port the existing R based Shiny app to run on openshift.

Climate Obs - re-create the Quality Control Process

Background

The R script that populates the climate_obs spreadsheet has logic to do automated quality control / assurances in it.
@KYSIEMENS is familiar with this process.

The long term plan is to include this logic into the python code that populates the climate obs spreadsheet.

Extract the Object Store Sync Code

code has been created to support this work that syncs data to object storage. This is a common operation for many of the RFC jobs. As such it makes sense to remove this code to put it into its own lib.

This ticket describes the work involved with extracting the code from this repo and putting it into this repo. https://github.com/bcgov/nr-objectstore-util

Definition of Done:

Code that does the object store sync is moved to the nr-objectstore-util repo
All jobs defined in climate obs repo continue to work , after the code has been moved
Documentation is created in the nr-objectstore-util that explains how to set up the three way sync between remote / local / ostore

Fix issue with the fwx download / reprocess

Spent a bunch of time getting the original R script to run as a container. Unfortunately this has not resulted in a working script. It is outputting some data, but not all the data.

Need to resolve why the data that is output from the reformat process is does not include all the fields.

Configuration of the api fields being imported

Currently the field configuration is part of the api class. We want to create a configuration object that can be used to describe how the data is going to be transformed when it is downloaded from the bc wildfire api.

Feat: Add Data Q/A check

The data that was downloaded on 2023-11-25 does not contain all the records for all the different time stamps.

Either the data wasn't available when the script ran, or the script was manually triggered due to a failure.

Current Logic:

script determines what day it is
Based on current date calculates the "end_date" as the current date, and the time is ammeded to be 9am
Calculates the local file path and the object store file path
Checks to see if the file that we are about to create exists in object storage, if it does not then it gets created, If it already exists nothing happens.

Enhancement:

add logic to the download prcoess to verify that there are 24 records for each weather station.
if < 24 records then raise an error indicating that the data available for the current date is incomplete
Modify the github action to detect job success/failure
Add notifciations for failed jobs.

Create GHA to injest the Wildfire weather stations

This work will integrate the downloading and preparation of the wildfire data. Attempt to containerize and use the existing R scripts to accomplish this.

Ideally modify them so that they use environment variables to define file locations used.

Add a process at the end that upload data to object storage.

When the process is triggered it should check to identify if the data already exists in object storage and if so halt processing.

Add Notifications for failures

We are starting to have a lot of jobs running as github Actions. We require some kind of notification system to notify us when the jobs fail.

This should be relatively easy to do. Ideally it involves going through either the repository dispatch jobs or the cron triggered jobs and adding steps that issue notifications.

There are a couple options for these notifications:

email based (easy, but manual config, manual subscribe)
integrate with a teams channel that anyone could subscribe to.

Should go through all the various other repos that have critical jobs associated with them also and add notifications there also

Create Dashboard for success of Jobs

As we move more and more jobs into opensearch need to be able to create an opensearch query that will report on the last 5 to 10 runs and identify if they were successful or not.

Create a opensearch query that report on the status of the firedata_pipe and the zxs_pipe.

If time allows create a kibana dashboard.

fix dependency installs

Pivoting to installing dependencies using poetry instead of pip in the gha jobs

climate_obs automated snow pillows (ASP) source data

We are currently pulling the ASP data from:
https://www.env.gov.bc.ca/wsd/data_searches/snow/asws/data/

Data is pulled daily from that location. Looks like the data is replaced on that site every hour?? not 100% sure on that.

Data is also replicated to the datamart to:
https://hpfx.collab.science.gc.ca/20230830/WXO-DD/observations/swob-ml/partners/bc-env-snow/20230830/1a01p/

And the origin of the data is the aquarius database.
LOTS of redundancy here, thinking long term we should try to pull the data directly from the aquarius API. Can work with water staff to configure that. Access to the API currently does not exist, but based on conversations sounds like its something that could be setup.

fix: formatting of the fire weather script

Run a linter on the script and clean up the code so it complies with pep8 standards

Extract the 3 way synk into pypi installable lib

A common operation that takes place with data used by hydrological models is what I refer to as a 3 way sync.

Data exists remotely by a data provider (most often federal gov datamart )
Local copy is created for rapid access / processing
The data gets persisted in object storage.

If a process should crash, It looks to object storage and will preferentially pull the data from object storage vs the original remote.

Operations to support this work have been wrapped up into a python module. To simplify the development of future work thinking that all data pipelines that require this flow would use the same code.

This ticket would see the extraction of the object storage sync into its own pipy installable module.
Could either create a new repo, or potentially add to the existing NR object storage lib.
(if make it part of NR object storage lib https://github.com/bcgov/nr-objectstore-util)

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Awaiting Schedule

These updates are awaiting their schedule. Click on a checkbox to get an update now.

chore(deps): lock file maintenance

Pending Status Checks

These updates await pending status checks. To force their creation now, click the checkbox below.

chore(deps): update dependency pyarrow to v17
chore(deps): update github actions all dependencies (major) (actions/checkout, actions/setup-python, docker/build-push-action, docker/login-action, dorny/paths-filter, shrink/actions-docker-registry-tag, ubuntu)

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

chore(deps): update dependency numpy to v2

Detected dependencies

dockerfile

r_data_prep.Dockerfile

rhub/r-minimal 4.5.0

rhub/r-minimal 4.5.0

r_py_data_prep.Dockerfile

rhub/r-minimal 4.5.0

rhub/r-minimal 4.5.0

win_docker/Dockerfile

mcr.microsoft.com/windows 1903

zxs_data_pull.Dockerfile

python 3.12.4-alpine

python 3.12.4-alpine

github-actions

.github/workflows/main_datamart_dl.yml

actions/checkout v3

actions/setup-python v4

ubuntu 20.04

.github/workflows/pr-close.yaml

redhat-actions/oc-login v1

actions/checkout v3

redhat-actions/oc-login v1

shrink/actions-docker-registry-tag v3

shrink/actions-docker-registry-tag v3

ubuntu 22.04

ubuntu 22.04

.github/workflows/pr-open-r.yaml

actions/checkout v3

dorny/paths-filter v2

dorny/paths-filter v2

docker/login-action v2

docker/build-push-action v4

shrink/actions-docker-registry-tag v3

docker/build-push-action v4

shrink/actions-docker-registry-tag v3

actions/checkout v3

redhat-actions/oc-login v1

ubuntu 22.04

ubuntu 22.04

ubuntu 22.04

.github/workflows/run_asp.yaml

actions/checkout v3

actions/setup-python v4

ubuntu 20.04

.github/workflows/run_climate_obs.yaml

actions/checkout v3

actions/setup-python v4

.github/workflows/run_fwx.yaml

actions/checkout v3

actions/setup-python v4

ubuntu 20.04

.github/workflows/run_mpoml.yaml

actions/checkout v3

actions/setup-python v4

ubuntu 20.04

helm-values

cicd/climateobs/values.yaml

pep621

pyproject.toml

pip_requirements

scripts/python/requirements-asp.txt

pandas ==2.2.2

bs4 ==0.0.2

nr_objstore_util ==0.10.0

requests ==2.32.3

scripts/python/requirements-datamartdl.txt

requests ==2.32.3

pandas ==2.2.2

numpy ==1.26.4

nr-objstore-util ==0.10.0

pyarrow ==14.0.2

scripts/python/requirements-dev.txt

black ==24.4.2

ruff ==0.5.2

scripts/python/requirements.txt

nr_objstore_util ==0.10.0

requests ==2.32.3

beautifulsoup4 ==4.12.3

python-dotenv ==1.0.1

poetry

pyproject.toml

python >=3.11,<3.13

nr-objstore-util ^0.10.0

requests ^2.31.0

beautifulsoup4 ^4.12.2

python-dotenv ^1.0.0

pandas ^2.1.1

pywin32 ^306

black 24.4.2

ruff ^0.5.0

mypy ^1.5.1

Check this box to trigger a request for Renovate to run again on this repository

Create GHA Job to ingest the Automated Snow Pillow Data

Pull the data, and then push to object storage. Use similar storage location as is used for on prem, when locating the files in object storage.

Existing scripts to be replaced:
ASP_Climate.bat which calls ASP_daily_climate.R

Rework this logic into python

Document Climate Observations Pipeline

Currently the repo: https://github.com/bcgov/nr-rfc-grib-copy collects and processes the climate forecast data.

The next input that is being tackled is the climate observations data pipeline. This pipeline injests data from

Automated Snow pillows - aquarius
Temperature by elevation data - federal gov data mart
BC Wildfire climate stations
other

Also try to document the data cleaning steps that take place when the data is injested by excel.

This task will identify all the existing schedules, and related scripts that are used to collect this information.

The first step for this work will be to create a repository where the work for a data pipeline can be documented.

Climate Observations Pipeline

describe tasks related to creating V1 of this pipeline

Dependencies

Epic

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

Integrate R code to prep wildfire data

Integrate the R code into the pipeline was started in issue #14 to complete the transformation of the XML data.

Upload the transformed data to object storage.

relates: #14

bcgov / nr-rfc-climate-obs Goto Github PK

nr-rfc-climate-obs's Introduction

nr-rfc-climate-obs

Contents:

nr-rfc-climate-obs's People

Contributors

Stargazers

Watchers

nr-rfc-climate-obs's Issues

TL;DR 🏎️

Words Matter

What about the master branch?

Preferred Phrasing

Pro Tip 🤓

Overview

Steps

TL;DR

Why Topic

What to do

How to use

Pro Tip 🤓

Ministry Short Codes

Data Aquisition

Processing

Background

Task

Background

Task

Background

Current Logic:

Enhancement:

Awaiting Schedule

Pending Status Checks

Open

Detected dependencies

Dependencies

Epic

No Project Lifecycle Badge found in your readme!

What is a Project Lifecycle Badge?

What do I need to do?

Recommend Projects

Recommend Topics

Recommend Org