Giter Site home page Giter Site logo

bcgov / nr-rfc-climate-obs Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 486 KB

Transition of the existiing climate observations data pipeline to enable running off prem

License: Apache License 2.0

Python 41.02% Dockerfile 1.28% R 3.52% Smarty 0.51% Shell 0.02% Jupyter Notebook 52.78% Batchfile 0.81% PowerShell 0.07%
lwrs mof nrids rfc

nr-rfc-climate-obs's Introduction

Lifecycle:Experimental

nr-rfc-climate-obs

Work related to re-building the climate observations data pipeline so that it can run in a variety of different environments.

Objective is to move as much of the existing R code as possible with minimal changes.

The general architectural pivot is moving away from using Shared File and Print (SFP) to using object storage for persistence of artifacts created by different data aquisition / processing scripts.

Contents:

nr-rfc-climate-obs's People

Contributors

chunchaokuo avatar derekroberts avatar frantarkenton avatar kysiemens avatar renovate[bot] avatar repo-mountie[bot] avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

nr-rfc-climate-obs's Issues

Modify permissions of jenkins service account

In order to sync data between the sfp drive for the river forecast centre using jenkins, the service account that the jenkins jobs run under needs to be modified so that it can read from that drive.

I have sent an email to Scott Sharp requesting this change.

Definition of Done

  • jenkins jobs can run and syncs the climate obs spreadsheet from the file share to object storage.

Lets use common phrasing

TL;DR ๐ŸŽ๏ธ

Teams are encouraged to favour modern inclusive phrasing both in their communication as well as in any source checked into their repositories. You'll find a table at the end of this text with preferred phrasing to socialize with your team.

Words Matter

We're aligning our development community to favour inclusive phrasing for common technical expressions. There is a table below that outlines the phrases that are being retired along with the preferred alternatives.

During your team scrum, technical meetings, documentation, the code you write, etc. use the inclusive phrasing from the table below. That's it - it really is that easy.

For the curious mind, the Public Service Agency (PSA) has published a guide describing how Words Matter in our daily communication. Its an insightful read and a good reminder to be curious and open minded.

What about the master branch?

The word "master" is not inherently bad or non-inclusive. For example people get a masters degree; become a master of their craft; or master a skill. It's generally when the word "master" is used along side the word "slave" that it becomes non-inclusive.

Some teams choose to use the word main for the default branch of a repo as opposed to the more commonly used master branch. While it's not required or recommended, your team is empowered to do what works for them. If you do rename the master branch consider using main so that we have consistency among the repos within our organization.

Preferred Phrasing

Non-Inclusive Inclusive
Whitelist => Allowlist
Blacklist => Denylist
Master / Slave => Leader / Follower; Primary / Standby; etc
Grandfathered => Legacy status
Sanity check => Quick check; Confidence check; etc
Dummy value => Placeholder value; Sample value; etc

Pro Tip ๐Ÿค“

This list is not comprehensive. If you're aware of other outdated nomenclature please create an issue (PR preferred) with your suggestion.

Deploy Fireweather Stations Pipeline to openshift

Recently became aware that the bcgov residency in github only has access to 20 github runners. As the load increases this will have impacts on the timing of jobs. For this reason looking to pivot to running the pipeline in openshift.

this ticket will see the python script that does the data aquisition and uploading to object storage as well as the R code that does the transformation of the xml files to csv, run in openshift.

DoD:

  • helm chart that deploys the kubernetes cron jobs
  • Investigate if the helm chart can be configured so that the two jobs have dependencies on one another. I.E. the download process runs and then calls the R script job.

To be completed, but not as part of this ticket:

  • kibana dashboard / query that allows easily identify if job succeeded or not

relates to: #21

weatherTimestamp not getting calculated correctly by main_fwx

the fire weather stations weather time stamps are being recorded as 2023112721, 2023112722, 2023112799. The one that is 2023112799 should be 2023112723.

Figure out what's going on with this and correct the script, then retroactively generate new data.

Create GHA to injest MPOML Data

This will be a simple pipeline that will pull the data from the federal gov datamart to a location in object store that emulates the location being used in the on prem pipeline.

The timeing for now will be the same as the schedule used on prem

Fix the ZXS script

the ingestion spreadsheet is going to look for a specific date for the ZXS data. Currently the ZXS data acquisition code is only creating a single file and then overwriting that file.

What ZXS script should do:

  • create a folder in object storage 'ZXS//ObsTephi_12_CZXS-2014-01-21.csv'
  • ditto for local storage of the data.
  • Don't think its necessary to maintain a the current file with just ObsTephi_12_CZXS.csv in the ZXS directory.

Write Climate Obs directly to the XL SS

Overview

Idea with this ticket is to move the logic that is currently contained in the xl ss import macro and the R climate_obs app to python, so that we can download the data and write it into the SS all in one go.

Steps

For all the various input data sets:

  1. ASP - Automated Snow Pillows
  2. ECCC - Environment Canada weather stations
  3. ZXS - Vertical temperature profiles
  4. F_WX - BC Wildfire Weather Stations

For each dataset:

  1. iterate over the various stations
  2. Look up the stations name with the column header from the METADATA tab in the xl spreadsheet.
  3. Copy the data for the current data / time into the ALL_DATA sheet.

Missing Data
How this is going to work hasn't been finalized at this time

  • One idea, is using the logic that was used in the R data ingestion script, where it looks at stations that are close by, and calculates the relative temperature differences between the two stations, and then using that information interpolates the missing values.

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means it's critical that we work to make our content as discoverable as possible. Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

add a topic

That's it, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip ๐Ÿค“

  • If your org is not in the list below, or the table contains errors, please create an issue here.

  • While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.

  • Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.

  • If your application is live, add the production URL.

Ministry Short Codes

Short Code Organization Name
AEST Advanced Education, Skills & Training
AGRI Agriculture
ALC Agriculture Land Commission
AG Attorney General
MCF Children & Family Development
CITZ Citizens' Services
DBC Destination BC
EMBC Emergency Management BC
EAO Environmental Assessment Office
EDUC Education
EMPR Energy, Mines & Petroleum Resources
ENV Environment & Climate Change Strategy
FIN Finance
FLNR Forests, Lands, Natural Resource Operations & Rural Development
HLTH Health
IRR Indigenous Relations & Reconciliation
JEDC Jobs, Economic Development & Competitiveness
LBR Labour Policy & Legislation
LDB BC Liquor Distribution Branch
MMHA Mental Health & Addictions
MAH Municipal Affairs & Housing
BCPC Pension Corporation
PSA Public Service Agency
PSSG Public Safety and Solicitor General
SDPR Social Development & Poverty Reduction
TCA Tourism, Arts & Culture
TRAN Transportation & Infrastructure
WLRS Water, Land and Resource Stewardship

NOTE See an error or omission? Please create an issue here to get it remedied.

Determine how jobs will be linked

There will be 4 separate jobs that need to be run before the R based data ingestion script can be run.

Need to determine how to deal with convergent jobs.

Some options:

  • Implement individual jobs as DAGS in airflow pipeline and the the R based data get triggered when the 4 data prep jobs are complete
  • Evaluate if a trigger can be setup using GHA that will only progress if all 4 jobs are complete.
  • Staggered cron jobs (possibly as initial option, but really want to avoid this option)
  • Other???

Integrate Running of COFFEE Model

This model could likely be defined inside of a GHA

Need to determin how the triggering will take place.

Would pull the dependent data down to local file system, then figure out a way to trigger the excel macro that initiates the coffee model.

When complete push the data back up to object storage.

Document the Jenkins object storage sync process

A jenkins process has been setup to sync data between object storage and our on prem SFP. The jenkins pipeline config is located here:
https://github.com/bcgov/nr-rfc-climate-obs/blob/main/scripts/bat/ostore_push.jenkins

Jenkins job config is here: https://apps.nrs.gov.bc.ca/int/cron/job/RIVER_FORECAST_CENTER/job/climate_obs_rsync

Should document somewhere (maybe the rfc wiki) how this works. Could be part of a larger ticket where we document how the various github based flows originate from.

Aquire ECCC Hourly Date

Create a script that will pull the following information on an hourly basis.

Source of data:
https://hpfx.collab.science.gc.ca/20231101/WXO-DD/observations/swob-ml/20231101/

Data Aquisition

  • need to figure out what weather stations we want to keep and which we do not
    • do bounding box (future create a bc buffered polygon that we can query)
    • For each weather station grab the lat longs and determine if they are in our area of interest.
    • if so then proceed to processing

Processing

  • Listed in the climate_obs spreadsheet to get the station list

  • pull down the station data for the current hour (note hours in the file names use UTC)

  • Extract from the individual xml files the following properties:

    • pcpn_amt_pst1hr
    • avg_air_temp_pst1hr
  • If a new day is detected then create a new file, otherwise pull the existing file from object store update it and repush (make sure we are not creating new versions)

  • create 2 different input files one for temperature and another for precip.

    • PC.csv
    • TA.csv
  • format of the files / columns:

    • date
    • climate stations (listed along the x axis like the PC.csv ASP data)
    • actual data (either precip. or temperature depending on which file is being created(
  • Script would run hourly when the data is available

  • Would pull the data down and update it, and then repost. (make sure we are not creating a new version in object storage when file is updated)

  • Need to setup a sync process that will ensure the data that exists in object store also exists on prem server.

    • on prem file path: Z:\MPOML\HOURLY (sewer)
    • object store path: RFC_DATA/ECC_HOURLY/

Secondary:

  • listen to the message queue for the specific data we want and trigger the github action

Convert ZXS Job from kubernetes job to gha

Background

When this work started was running into issues related to shortage of github action runners. At that time only 20 runners were available to github actions running in the bcgov org. Recently that number has been increased to 500. Deploying data acquision jobs as GHA is significantly easier than as kubernetes jobs. For that reason am pivoting most jobs back to run as GHA's.

Task

Modify the existing ZXS job to run as a GHA.

Transition from cron based schedule to Event based Triggering

Transition the schedule based triggers for the 4 input datasets to the climate obs pipeline to an event based system using an AMQP listener. Take the work that was created in the CMC Grib download repo and re-use.

Input datasets / pipelines:

  • Automated Snow Pillow Data (ASP)
  • MPOML data for today/yesterday
  • BC Wildfire weather stations
  • ZXS temperature data

Address ECCC version data in Object Storage

Background

The ECCC script pulls hourly data from the federal governments data mart, does some reformatting and ultimately creates the files in the object storage bucket into the following directory: RFC_DATA/ECCC/hourly/csv

The script is currently running every hour. Each time it runs it creates a new version in object storage.

Task

Modify the ECCC code and update it so that there can only ever be two versions. If there are more than two versions the oldest ones are autodeleted.

The best place to implement this is the upstream nr-objectstore-util lib. Configure it so that there is an argument for the put operations that defines the maximum number of versions you want to maintain. If not populated then doesn't do anything, and just creates a new version, however if you specify an arguement of version=2 then it will delete any versions that are older than the 2 newest ones.

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Location: renovate.json
Error type: Invalid JSON (parsing failed)
Message: Syntax error: expecting end of expression or separator near ]
"igno

Configure automatic run of XL based climate obs data prep

an R based shiny app has been created to review / fix / the climate observations data. Need to:

  • identify the code used to prep the Xl Spreadsheet
  • Modify the paths
  • Containerize
  • Integrate the running of this code as a GHA.

Significant amount of work to accomplish this task. Another task will also be defined that will port the existing R based Shiny app to run on openshift.

Climate Obs - re-create the Quality Control Process

Background

The R script that populates the climate_obs spreadsheet has logic to do automated quality control / assurances in it.
@KYSIEMENS is familiar with this process.

The long term plan is to include this logic into the python code that populates the climate obs spreadsheet.

Extract the Object Store Sync Code

code has been created to support this work that syncs data to object storage. This is a common operation for many of the RFC jobs. As such it makes sense to remove this code to put it into its own lib.

This ticket describes the work involved with extracting the code from this repo and putting it into this repo. https://github.com/bcgov/nr-objectstore-util

Definition of Done:

  • Code that does the object store sync is moved to the nr-objectstore-util repo
  • All jobs defined in climate obs repo continue to work , after the code has been moved
  • Documentation is created in the nr-objectstore-util that explains how to set up the three way sync between remote / local / ostore

Fix issue with the fwx download / reprocess

Spent a bunch of time getting the original R script to run as a container. Unfortunately this has not resulted in a working script. It is outputting some data, but not all the data.

Need to resolve why the data that is output from the reformat process is does not include all the fields.

Configuration of the api fields being imported

Currently the field configuration is part of the api class. We want to create a configuration object that can be used to describe how the data is going to be transformed when it is downloaded from the bc wildfire api.

Feat: Add Data Q/A check

The data that was downloaded on 2023-11-25 does not contain all the records for all the different time stamps.

Either the data wasn't available when the script ran, or the script was manually triggered due to a failure.

Current Logic:

  • script determines what day it is
  • Based on current date calculates the "end_date" as the current date, and the time is ammeded to be 9am
  • Calculates the local file path and the object store file path
  • Checks to see if the file that we are about to create exists in object storage, if it does not then it gets created, If it already exists nothing happens.

Enhancement:

  • add logic to the download prcoess to verify that there are 24 records for each weather station.
  • if < 24 records then raise an error indicating that the data available for the current date is incomplete
  • Modify the github action to detect job success/failure
  • Add notifciations for failed jobs.

Create GHA to injest the Wildfire weather stations

This work will integrate the downloading and preparation of the wildfire data. Attempt to containerize and use the existing R scripts to accomplish this.

Ideally modify them so that they use environment variables to define file locations used.

Add a process at the end that upload data to object storage.

When the process is triggered it should check to identify if the data already exists in object storage and if so halt processing.

Add Notifications for failures

We are starting to have a lot of jobs running as github Actions. We require some kind of notification system to notify us when the jobs fail.

This should be relatively easy to do. Ideally it involves going through either the repository dispatch jobs or the cron triggered jobs and adding steps that issue notifications.

There are a couple options for these notifications:

  • email based (easy, but manual config, manual subscribe)
  • integrate with a teams channel that anyone could subscribe to.

Should go through all the various other repos that have critical jobs associated with them also and add notifications there also

Create Dashboard for success of Jobs

As we move more and more jobs into opensearch need to be able to create an opensearch query that will report on the last 5 to 10 runs and identify if they were successful or not.

Create a opensearch query that report on the status of the firedata_pipe and the zxs_pipe.

If time allows create a kibana dashboard.

climate_obs automated snow pillows (ASP) source data

We are currently pulling the ASP data from:
https://www.env.gov.bc.ca/wsd/data_searches/snow/asws/data/

Data is pulled daily from that location. Looks like the data is replaced on that site every hour?? not 100% sure on that.

Data is also replicated to the datamart to:
https://hpfx.collab.science.gc.ca/20230830/WXO-DD/observations/swob-ml/partners/bc-env-snow/20230830/1a01p/

And the origin of the data is the aquarius database.
LOTS of redundancy here, thinking long term we should try to pull the data directly from the aquarius API. Can work with water staff to configure that. Access to the API currently does not exist, but based on conversations sounds like its something that could be setup.

Extract the 3 way synk into pypi installable lib

A common operation that takes place with data used by hydrological models is what I refer to as a 3 way sync.

  • Data exists remotely by a data provider (most often federal gov datamart )
  • Local copy is created for rapid access / processing
  • The data gets persisted in object storage.

If a process should crash, It looks to object storage and will preferentially pull the data from object storage vs the original remote.

Operations to support this work have been wrapped up into a python module. To simplify the development of future work thinking that all data pipelines that require this flow would use the same code.

This ticket would see the extraction of the object storage sync into its own pipy installable module.
Could either create a new repo, or potentially add to the existing NR object storage lib.
(if make it part of NR object storage lib https://github.com/bcgov/nr-objectstore-util)

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Awaiting Schedule

These updates are awaiting their schedule. Click on a checkbox to get an update now.

  • chore(deps): lock file maintenance

Pending Status Checks

These updates await pending status checks. To force their creation now, click the checkbox below.

  • chore(deps): update dependency pyarrow to v17
  • chore(deps): update github actions all dependencies (major) (actions/checkout, actions/setup-python, docker/build-push-action, docker/login-action, dorny/paths-filter, shrink/actions-docker-registry-tag, ubuntu)

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

dockerfile
r_data_prep.Dockerfile
  • rhub/r-minimal 4.5.0
  • rhub/r-minimal 4.5.0
r_py_data_prep.Dockerfile
  • rhub/r-minimal 4.5.0
  • rhub/r-minimal 4.5.0
win_docker/Dockerfile
  • mcr.microsoft.com/windows 1903
zxs_data_pull.Dockerfile
  • python 3.12.4-alpine
  • python 3.12.4-alpine
github-actions
.github/workflows/main_datamart_dl.yml
  • actions/checkout v3
  • actions/setup-python v4
  • ubuntu 20.04
.github/workflows/pr-close.yaml
  • redhat-actions/oc-login v1
  • actions/checkout v3
  • redhat-actions/oc-login v1
  • shrink/actions-docker-registry-tag v3
  • shrink/actions-docker-registry-tag v3
  • ubuntu 22.04
  • ubuntu 22.04
.github/workflows/pr-open-r.yaml
  • actions/checkout v3
  • dorny/paths-filter v2
  • dorny/paths-filter v2
  • docker/login-action v2
  • docker/build-push-action v4
  • shrink/actions-docker-registry-tag v3
  • docker/build-push-action v4
  • shrink/actions-docker-registry-tag v3
  • actions/checkout v3
  • redhat-actions/oc-login v1
  • ubuntu 22.04
  • ubuntu 22.04
  • ubuntu 22.04
.github/workflows/run_asp.yaml
  • actions/checkout v3
  • actions/setup-python v4
  • ubuntu 20.04
.github/workflows/run_climate_obs.yaml
  • actions/checkout v3
  • actions/setup-python v4
.github/workflows/run_fwx.yaml
  • actions/checkout v3
  • actions/setup-python v4
  • ubuntu 20.04
.github/workflows/run_mpoml.yaml
  • actions/checkout v3
  • actions/setup-python v4
  • ubuntu 20.04
helm-values
cicd/climateobs/values.yaml
pep621
pyproject.toml
pip_requirements
scripts/python/requirements-asp.txt
  • pandas ==2.2.2
  • bs4 ==0.0.2
  • nr_objstore_util ==0.10.0
  • requests ==2.32.3
scripts/python/requirements-datamartdl.txt
  • requests ==2.32.3
  • pandas ==2.2.2
  • numpy ==1.26.4
  • nr-objstore-util ==0.10.0
  • pyarrow ==14.0.2
scripts/python/requirements-dev.txt
  • black ==24.4.2
  • ruff ==0.5.2
scripts/python/requirements.txt
  • nr_objstore_util ==0.10.0
  • requests ==2.32.3
  • beautifulsoup4 ==4.12.3
  • python-dotenv ==1.0.1
poetry
pyproject.toml
  • python >=3.11,<3.13
  • nr-objstore-util ^0.10.0
  • requests ^2.31.0
  • beautifulsoup4 ^4.12.2
  • python-dotenv ^1.0.0
  • pandas ^2.1.1
  • pywin32 ^306
  • black 24.4.2
  • ruff ^0.5.0
  • mypy ^1.5.1

  • Check this box to trigger a request for Renovate to run again on this repository

Create GHA Job to ingest the Automated Snow Pillow Data

Pull the data, and then push to object storage. Use similar storage location as is used for on prem, when locating the files in object storage.

Existing scripts to be replaced:
ASP_Climate.bat which calls ASP_daily_climate.R

Rework this logic into python

Document Climate Observations Pipeline

Currently the repo: https://github.com/bcgov/nr-rfc-grib-copy collects and processes the climate forecast data.

The next input that is being tackled is the climate observations data pipeline. This pipeline injests data from

  • Automated Snow pillows - aquarius
  • Temperature by elevation data - federal gov data mart
  • BC Wildfire climate stations
  • other

Also try to document the data cleaning steps that take place when the data is injested by excel.

This task will identify all the existing schedules, and related scripts that are used to collect this information.

The first step for this work will be to create a repository where the work for a data pipeline can be documented.

Add project lifecycle badge

No Project Lifecycle Badge found in your readme!

Hello! I scanned your readme and could not find a project lifecycle badge. A project lifecycle badge will provide contributors to your project as well as other stakeholders (platform services, executive) insight into the lifecycle of your repository.

What is a Project Lifecycle Badge?

It is a simple image that neatly describes your project's stage in its lifecycle. More information can be found in the project lifecycle badges documentation.

What do I need to do?

I suggest you make a PR into your README.md and add a project lifecycle badge near the top where it is easy for your users to pick it up :). Once it is merged feel free to close this issue. I will not open up a new one :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.