Giter Site home page Giter Site logo

drb-estuary-salinity-ml's People

Contributors

amsnyder avatar galengorski avatar jds485 avatar jpadilla-usgs avatar ted80810 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

drb-estuary-salinity-ml's Issues

Timezones for aggregating sub-daily data

When we are aggregating sub-daily data to a daily time step, I think we should make sure that we aggregate in local time. I think a lot of our data queries happen in UTC/GMT, and when we aggregate that data then the daily tidal signature (or meteorological signal) will get shifted. Am I thinking about this right?

Error when running Snakefile_fetch_munge (#95) from scratch

I have deleted all of files in 01_fetch/out/ and 02_munge/out and I am trying to make sure that Snakefile_fetch_munge can recreate them. The most up to date version of Snakefile_fetch_munge is in #95.

When I run snakefile -s Snakefile_fetch_munge --cores all I get the following error:

snakemake_error_nwis_fetch

However 01_fetch/out/usgs_nwis_01474500.txt is created and it appears to be the correct data

usgs data getting dropped between fetch and munge step

It looks like there are some data that get dropped during the munge step of the pipeline. Looking specifically at usgs site 01463500 (Trenton) in 1_fetch/out , I see specific conductance data with a parameter flag "A" which indicates that it's approved. However in 1_munge/out/usgs_nwis_01463500.csv, there is no specific conductance data. I wonder if this is an issue with the flags maybe? I haven't systematically checked with the other sites yet, but maybe it is an issue elsewhere.

Add NOAA NOS munge step

  • should process data to daily average
  • process all variable files for one site into a single file with one variable per column

update Snakefile

The snakefile needs to be updated to include the three steps in 03_it_analysis:

  1. it_analysis_data_prep.py
  2. make_heatmap_matrix.py
  3. plot_heat_map_config.py

Extract daily data from sub-daily tidal data

This was converted from a discussion into an issue, I have copy and pasted the replies below:

Based on 2/15/22 meeting (notes here) trying to extract data from sub-daily tidal data so that it is usable on the daily time step seems like a more tractable approach.

One suggestion for capturing a sub-daily temporal signal in daily water level data from @aappling-usgs:

  • split the hourly water level data into 24 different "daily" datasets, so we would have water level at 1:00am, 2:00am etc at a daily time step. This seems like it might be worth a shot for modeling, but it will be less interpretable from an "analysis of the drivers" point of view
  • @ted80810 is looking into some daily summary statistics of tidal fluctuations that we might be able to use

Originally posted by @galengorski in #50 (comment)

butterworth filter not working on all sites

Some sites return no data when put through the butterworth filter in the munge step (ex: site 8551762). Appears to be an issue on sites that have missing data in the time series.

Interpreting functional performance of estuary salinity model

I have been trying to produce some functional performance metrics for the Estuary Salinity ML model and comparing them to the COAWST (hydrodynamic model). I am trying to interpret them, but I would love to get thoughts and discussion. We have COAWST model runs for calendar years 2016, 2018, and 2019. We have run the ML model from 2001-2020, with 2001-2015 as training and 2016-2020 as testing periods. The questions that we are trying to address with functional performance are:

  1. How well do the ML and COAWST models reproduce relationships between input variables and output variables?
  2. Are there critical time scales that either model does/doesn't reproduces well?
  3. Can we use IT metrics to help identify processes that the models are/aren't representing well?

The following are a couple of plots with explanation and questions for discussion. I would love to get your opinions or thoughts when you have a second @jds485 @salme146 @jdiaz4302 as I know you all have different expertise on this.

Discussion: What is the best time scale to use for our models and analysis?

@galengorski

These thoughts are based on some conversations with @amsnyder @ted80810 and @salme146, I'd like to make this discussion a place where we can talk about model time scales and what it might take to move from daily to sub-daily time scale (and if it would be worthwhile).

Up to this point we have been working at the daily time step. We're working at this time step for a few reasons:

  1. Daily time step (or even weekly) are the time scales that stakeholders consider and management decisions are made on
  2. It makes multi-year analysis and modeling more computationally manageable
  3. Daily time step is the resolution that inland salinity and other DRB modeling campaigns are using, which might make eventual coupling easier
  4. Meteorological data from GridMET is at the daily time step
  5. Daily discharge from Trenton and Schuylkill has gaps, which can easily be filled by PRMS predictions at the daily time scale

However daily time step has some down sides too:

  1. Tidal forcings at the mouth of the estuary are really important for driving water in and out of the estuary which can have a huge influence on the salt front location. Aggregating tidal signals to a daily average doesn't make sense because tides have a dominant frequency of 12 hours. Talking to @salme146 and John, there really isn't a great way to represent tidal information at the daily time step.
  2. Information theory calculations are pretty data hungry, they require ~200-300 data points to robustly estimate the pdf of the variable depending on the distribution etc. This means that with the daily data we would only be able to make calculations every year or so, having finer resolution data would mean that we could calculate information transfer (timescales, redundancy, synergies etc) seasonally and for specific storm events, which would be really interesting.

My take is that working with the daily time scale is fine for now and might make the development of methods easier, but it might be a good idea to see what it would take to move to a sub-daily timescale.

changes to environment.yaml file

We need to incorporate the following changes to the environment.yaml file:

  • install the yaml module using conda install anaconda yaml so that it is accessible within the anaconda env
  • add the drb_estuary_salinity main repo directory to the pythonpath so that the utils module can be loaded. I did this locally by creating a .pth file and adding it to the site-packages directory following this blog post.

Discussion: Git workflow for editing pipeline more cleanly

@galengorski

This is a generalized workflow for editing different parts of the pipeline more cleanly so that we can try to avoid multiple people working on the same part of the project. Also it's a place where I can put these git commands so I don't forget them.

  • Working on 03b_model/src/run_model.py on a branch called model_edits and find that a change to 01_fetch/src/fetch_usgs_nwis.py is required
  • on model_edits commit changes to run_model.py
  • switch back to main branch on local
  • create new branch (from the main branch) for small changes to fetch_usgs_nwis.py using git checkout -b fetch_changes
  • make small changes to fetch, commit, push upstream fetch_changes creating new branch in remote repo
  • merge small changes to fetch into remote repo main branch
  • back on local, switch to main branch git checkout main
  • pull the changes from remote repo in local main using git pull upstream main
  • switch back to model_edits branch
  • merge in new changes git merge main -m "merge small change to fetch nwis"

Testing information theory functions

I'd like to come up with a series of tests to benchmark and get a feel for the information theory functions. In general the goals would be:

  1. Compare results to other methods/code libraries to make sure we are producing similar answers and the functions are technically sound
  2. Create examples that help develop intuition for what these functions tell us and how to interpret their results
  3. Conduct sensitivity testing to understand how decisions like bin numbers and calculating critical values affect the final results

Pipeline Improvements

Compiling a list of pipeline improvements that could be made in the future, if time allows:

  • Update Snakefile to run git submodule add to get river-dl code into 03b_model/src directory
  • Use dask to distribute the work in fetch_coawst_model.py
  • Set up a Docker container and push that image to dockerhub to run our pipeline in
  • Get Docker container running on Tallgrass
  • Get Snakemake/S3 connection working so inputs/outputs can come directly from S3
  • Use snakemake inputs/outputs/params in python script instead of having them hardcoded in script
  • Review Snakefiles for other improvements, based on information gained in Snakemake tutorial writing process

Issues building the pipeline

Documenting issues for PR #101:
I'm using conda 4.12.0

  1. Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies. Conda may not use the correct pip to install your packages, and they may end up in the wrong place. Please add an explicit pip dependency. I'm adding one for you, but still nagging you.
  • I think you can solve this by adding pip to the dependencies list
  1. Solving environment: failed
    ResolvePackageNotFound:

Discussion: Ideas on how to use information theory to assess models

@galengorski

  1. Using mutual information as a measure of predictive performance and transfer entropy as a measure functional performance across a range of model decisions (could be addition of process guidance, calibration of a parameter)

  2. Using temporal information partitioning (TIPNets) to investigate the amount of information flow from dynamic features to modeled or observed output that is unique, redundant and/or synergistic with another feature. This is another way to assess feature -> target relationships and how a model represents them.

  3. Characterizing critical time scales of influence from feature to target by assessing transfer entropy across a range of time lags. Then comparing time scales across different sites to look at how different site characteristics are associated with process coupling

Below are some more specific comments and examples for each method.

Site 01467200 has anomalous column headers

It looks like site 01467200, Delaware River at Penn's Landing has multiple observations for several of the parameters that we're interested in (temperature, specific conductivity, pH etc). It looks like this is because they are taking some new measurements at this site in collaboration with the Independence Seaport Museum, hence why some of the column headers have ISM in the name. Unfortunately these have the same parameter values as the variables that we actually want, so we'll have to screen them out maybe by column name. I went into the munge/out csv file and just deleted the columns, but I think this could be done in the munge step.

calculation of salt front

I have been trying to recreate the salt front time series supplied by Salme at al. That dataset gives both a daily and a 7-day averaged estimate of the salt front location. The drb has a dashboard where they report the salt front location and they give the methods for calculation:
saltfrontcalcs

add atmospheric pressure into noaa fetch

We want to add atmospheric pressure as a variable fetched from noaa. This made me think, why don't we fetch temperature and wind from the noaa stations as well. The noaa stations don't have precipitation, but that likely isn't an important driver and we can get that from gridMET. Lewes and Cape May only have barometric pressure, air temperature, and wind starting 08-12-2002, which means we would miss out on 2001, which was a drought year.

NERRS and gridMET windspeed data do not agree

In looking at data from the National Estuarine Reserve System (NERRS) and comparing it to gridmet data for the same location, I have found that the temperature, precipitation, and wind direction match decently well. However the windspeed does not

windspeed_nerrs_gridmet

here is a zoomed in version:
windspeed_nerrs_gridmet_zoomin

The nerrs data is only missing < 0.5% of the data (35/7305) so we will just linearly interpolate to fill missing days, but it's a little weird. One possible explanation is that the NERRS site is slightly inland from the estuary, while the gridMET resolution is 4km which includes part of the estuary. The average wind speed in that 4 km grid cell could just be higher if much of it were over open water.

historical data for preprocessing

A common step in preprocessing data for information theory is removing a persistent seasonal signal in the data. To do this usually a day of year average value is subtracted from each data point, but to do this we would need several years of data prior to the year that we are processing. Right now the plan is to analyze 2019. Not sure how much data we'd need to do this effectively.

Discussion: River pathway along which we want to calculate salt front location

@amsnyder:

The code that @ted80810 wrote to fetch the COAWST model output pulls in the pathway along which we should calculate salt front location from a csv file of coordinates that was provided to us by @salme146. This pathway is close to the shore, rather than in the middle of the channel or somewhere else. I am creating this discussion to flag this and possibly have more discussion about if this is the best place to calculate salt front location.

Discussion: Thinking about driver time lags in understanding the salt front dynamics

@galengorski

We're interested in the general question, "What are the key drivers of the salt front location and (how) do those vary as a function of the salt front location?"

One approach is to bin the salt front time series into "river mile bins" based on changes in river geometry or bathymetry and look at the relationship between drivers and salt front location within each one of those bins.

salt_front_time_series_with_intervals

This is a time series of the salt front location with the red lines indicating the locations of the bins. The bottom panels show the discharge and specific conductivity at Trenton and Schuylkill. These are the drivers we'll start with.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.