usgs-r / drb-estuary-salinity-ml Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 4.0 21.37 MB

License: Creative Commons Zero v1.0 Universal

Jupyter Notebook 97.66% Python 2.31% R 0.03%

drb-estuary-salinity-ml's People

Contributors

Watchers

Forkers

galengorski ted80810 amsnyder

drb-estuary-salinity-ml's Issues

Timezones for aggregating sub-daily data

When we are aggregating sub-daily data to a daily time step, I think we should make sure that we aggregate in local time. I think a lot of our data queries happen in UTC/GMT, and when we aggregate that data then the daily tidal signature (or meteorological signal) will get shifted. Am I thinking about this right?

Error when running Snakefile_fetch_munge (#95) from scratch

I have deleted all of files in 01_fetch/out/ and 02_munge/out and I am trying to make sure that Snakefile_fetch_munge can recreate them. The most up to date version of Snakefile_fetch_munge is in #95.

When I run snakefile -s Snakefile_fetch_munge --cores all I get the following error:

However 01_fetch/out/usgs_nwis_01474500.txt is created and it appears to be the correct data

usgs data getting dropped between fetch and munge step

It looks like there are some data that get dropped during the munge step of the pipeline. Looking specifically at usgs site 01463500 (Trenton) in 1_fetch/out , I see specific conductance data with a parameter flag "A" which indicates that it's approved. However in 1_munge/out/usgs_nwis_01463500.csv, there is no specific conductance data. I wonder if this is an issue with the flags maybe? I haven't systematically checked with the other sites yet, but maybe it is an issue elsewhere.

Add NOAA NOS munge step

should process data to daily average
process all variable files for one site into a single file with one variable per column

Add config file to NOAA NOS fetch step

pull variables that might change from NOAA NOS fetch step into a config file

Create README file

The directions from this pull request #117 should be compiled into a README file

Develop Binder for Salt_front_drivers_exploration notebook

Questions:
How do users run binder on local machine (outside of web browser)?
Can users make changes to Notebook on Binder?
How do you save notebook with environment to local machine?

Sub-topics:
How to increase start up time.

Add saltfront_record.R to snakefile_fetch_munge

Implement snakemake on fetch and munge steps

update Snakefile

The snakefile needs to be updated to include the three steps in 03_it_analysis:

it_analysis_data_prep.py
make_heatmap_matrix.py
plot_heat_map_config.py

Extract daily data from sub-daily tidal data

This was converted from a discussion into an issue, I have copy and pasted the replies below:

Based on 2/15/22 meeting (notes here) trying to extract data from sub-daily tidal data so that it is usable on the daily time step seems like a more tractable approach.

One suggestion for capturing a sub-daily temporal signal in daily water level data from @aappling-usgs:

split the hourly water level data into 24 different "daily" datasets, so we would have water level at 1:00am, 2:00am etc at a daily time step. This seems like it might be worth a shot for modeling, but it will be less interpretable from an "analysis of the drivers" point of view
@ted80810 is looking into some daily summary statistics of tidal fluctuations that we might be able to use

Originally posted by @galengorski in #50 (comment)

butterworth filter not working on all sites

Some sites return no data when put through the butterworth filter in the munge step (ex: site 8551762). Appears to be an issue on sites that have missing data in the time series.

Interpreting functional performance of estuary salinity model

I have been trying to produce some functional performance metrics for the Estuary Salinity ML model and comparing them to the COAWST (hydrodynamic model). I am trying to interpret them, but I would love to get thoughts and discussion. We have COAWST model runs for calendar years 2016, 2018, and 2019. We have run the ML model from 2001-2020, with 2001-2015 as training and 2016-2020 as testing periods. The questions that we are trying to address with functional performance are:

How well do the ML and COAWST models reproduce relationships between input variables and output variables?
Are there critical time scales that either model does/doesn't reproduces well?
Can we use IT metrics to help identify processes that the models are/aren't representing well?

The following are a couple of plots with explanation and questions for discussion. I would love to get your opinions or thoughts when you have a second @jds485 @salme146 @jdiaz4302 as I know you all have different expertise on this.

Discussion: What is the best time scale to use for our models and analysis?

@galengorski

These thoughts are based on some conversations with @amsnyder @ted80810 and @salme146, I'd like to make this discussion a place where we can talk about model time scales and what it might take to move from daily to sub-daily time scale (and if it would be worthwhile).

Up to this point we have been working at the daily time step. We're working at this time step for a few reasons:

Daily time step (or even weekly) are the time scales that stakeholders consider and management decisions are made on

It makes multi-year analysis and modeling more computationally manageable

Daily time step is the resolution that inland salinity and other DRB modeling campaigns are using, which might make eventual coupling easier

Meteorological data from GridMET is at the daily time step

Daily discharge from Trenton and Schuylkill has gaps, which can easily be filled by PRMS predictions at the daily time scale

However daily time step has some down sides too:

Tidal forcings at the mouth of the estuary are really important for driving water in and out of the estuary which can have a huge influence on the salt front location. Aggregating tidal signals to a daily average doesn't make sense because tides have a dominant frequency of 12 hours. Talking to @salme146 and John, there really isn't a great way to represent tidal information at the daily time step.

Information theory calculations are pretty data hungry, they require ~200-300 data points to robustly estimate the pdf of the variable depending on the distribution etc. This means that with the daily data we would only be able to make calculations every year or so, having finer resolution data would mean that we could calculate information transfer (timescales, redundancy, synergies etc) seasonally and for specific storm events, which would be really interesting.

My take is that working with the daily time scale is fine for now and might make the development of methods easier, but it might be a good idea to see what it would take to move to a sub-daily timescale.

changes to environment.yaml file

We need to incorporate the following changes to the environment.yaml file:

install the yaml module using conda install anaconda yaml so that it is accessible within the anaconda env
add the drb_estuary_salinity main repo directory to the pythonpath so that the utils module can be loaded. I did this locally by creating a .pth file and adding it to the site-packages directory following this blog post.

Discussion: Git workflow for editing pipeline more cleanly

@galengorski

This is a generalized workflow for editing different parts of the pipeline more cleanly so that we can try to avoid multiple people working on the same part of the project. Also it's a place where I can put these git commands so I don't forget them.

Working on 03b_model/src/run_model.py on a branch called model_edits and find that a change to 01_fetch/src/fetch_usgs_nwis.py is required

on model_edits commit changes to run_model.py

switch back to main branch on local

create new branch (from the main branch) for small changes to fetch_usgs_nwis.py using git checkout -b fetch_changes

make small changes to fetch, commit, push upstream fetch_changes creating new branch in remote repo

merge small changes to fetch into remote repo main branch

back on local, switch to main branch git checkout main

pull the changes from remote repo in local main using git pull upstream main

switch back to model_edits branch

merge in new changes git merge main -m "merge small change to fetch nwis"

Fetch multiple NOAA stations

Add list of NOAA stations from salme's spreadsheet and pull data for each - one file per site

Testing information theory functions

I'd like to come up with a series of tests to benchmark and get a feel for the information theory functions. In general the goals would be:

Compare results to other methods/code libraries to make sure we are producing similar answers and the functions are technically sound
Create examples that help develop intuition for what these functions tell us and how to interpret their results
Conduct sensitivity testing to understand how decisions like bin numbers and calculating critical values affect the final results

Pipeline Improvements

Compiling a list of pipeline improvements that could be made in the future, if time allows:

Update Snakefile to run git submodule add to get river-dl code into 03b_model/src directory
Use dask to distribute the work in fetch_coawst_model.py
Set up a Docker container and push that image to dockerhub to run our pipeline in
Get Docker container running on Tallgrass
Get Snakemake/S3 connection working so inputs/outputs can come directly from S3
Use snakemake inputs/outputs/params in python script instead of having them hardcoded in script
Review Snakefiles for other improvements, based on information gained in Snakemake tutorial writing process

Issues building the pipeline

Documenting issues for PR #101:
I'm using conda 4.12.0

Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.

I think you can solve this by adding pip to the dependencies list

Solving environment: failed
ResolvePackageNotFound:

snakemake-minimal
torchvision
torchaudio
plotly=5.5.0
This is probably due to different default channels that you all have. Here are mine:
channel URLs : https://repo.anaconda.com/pkgs/main/win-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/win-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/msys2/win-64
https://repo.anaconda.com/pkgs/msys2/noarch

Test fetch/munge with Python 3.8

Fetch full year of COAWST model output

Currently only fetch 1 month of output, need to update to fetch all 12-13 files and concatenate results. Look into dask for parallelizing process.

Discussion: Ideas on how to use information theory to assess models

@galengorski

Using mutual information as a measure of predictive performance and transfer entropy as a measure functional performance across a range of model decisions (could be addition of process guidance, calibration of a parameter)

Using temporal information partitioning (TIPNets) to investigate the amount of information flow from dynamic features to modeled or observed output that is unique, redundant and/or synergistic with another feature. This is another way to assess feature -> target relationships and how a model represents them.

Characterizing critical time scales of influence from feature to target by assessing transfer entropy across a range of time lags. Then comparing time scales across different sites to look at how different site characteristics are associated with process coupling

Below are some more specific comments and examples for each method.

Site 01467200 has anomalous column headers

It looks like site 01467200, Delaware River at Penn's Landing has multiple observations for several of the parameters that we're interested in (temperature, specific conductivity, pH etc). It looks like this is because they are taking some new measurements at this site in collaboration with the Independence Seaport Museum, hence why some of the column headers have ISM in the name. Unfortunately these have the same parameter values as the variables that we actually want, so we'll have to screen them out maybe by column name. I went into the munge/out csv file and just deleted the columns, but I think this could be done in the munge step.

calculation of salt front

I have been trying to recreate the salt front time series supplied by Salme at al. That dataset gives both a daily and a 7-day averaged estimate of the salt front location. The drb has a dashboard where they report the salt front location and they give the methods for calculation:

Develop data coverage plots for historical NOAA and USGS

Fetch/process COAWST Model runs

Issues with COAWST Model run scripts.

add atmospheric pressure into noaa fetch

We want to add atmospheric pressure as a variable fetched from noaa. This made me think, why don't we fetch temperature and wind from the noaa stations as well. The noaa stations don't have precipitation, but that likely isn't an important driver and we can get that from gridMET. Lewes and Cape May only have barometric pressure, air temperature, and wind starting 08-12-2002, which means we would miss out on 2001, which was a drought year.

Fetch other NOAA NOS variables

Fetch all relevant NOAA NOS variables and store each as a data file

NERRS and gridMET windspeed data do not agree

In looking at data from the National Estuarine Reserve System (NERRS) and comparing it to gridmet data for the same location, I have found that the temperature, precipitation, and wind direction match decently well. However the windspeed does not

here is a zoomed in version:

The nerrs data is only missing < 0.5% of the data (35/7305) so we will just linearly interpolate to fill missing days, but it's a little weird. One possible explanation is that the NERRS site is slightly inland from the estuary, while the gridMET resolution is 4km which includes part of the estuary. The average wind speed in that 4 km grid cell could just be higher if much of it were over open water.

Error when fetching COAWST model results using #99

I am using what I think is the latest version of fetch_coawst_model.py from #99 and I am getting the following error:

it looks like it is having trouble reading the url

historical data for preprocessing

A common step in preprocessing data for information theory is removing a persistent seasonal signal in the data. To do this usually a day of year average value is subtracted from each data point, but to do this we would need several years of data prior to the year that we are processing. Right now the plan is to analyze 2019. Not sure how much data we'd need to do this effectively.

Discussion: River pathway along which we want to calculate salt front location

@amsnyder:

The code that @ted80810 wrote to fetch the COAWST model output pulls in the pathway along which we should calculate salt front location from a csv file of coordinates that was provided to us by @salme146. This pathway is close to the shore, rather than in the middle of the channel or somewhere else. I am creating this discussion to flag this and possibly have more discussion about if this is the best place to calculate salt front location.

Create read local/S3 toggle

Discussion: Thinking about driver time lags in understanding the salt front dynamics

@galengorski

We're interested in the general question, "What are the key drivers of the salt front location and (how) do those vary as a function of the salt front location?"

One approach is to bin the salt front time series into "river mile bins" based on changes in river geometry or bathymetry and look at the relationship between drivers and salt front location within each one of those bins.

This is a time series of the salt front location with the red lines indicating the locations of the bins. The bottom panels show the discharge and specific conductivity at Trenton and Schuylkill. These are the drivers we'll start with.

usgs-r / drb-estuary-salinity-ml Goto Github PK

drb-estuary-salinity-ml's People

Contributors

Watchers

Forkers

drb-estuary-salinity-ml's Issues

Recommend Projects

Recommend Topics

Recommend Org