usgs-r / drb-do-ml Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 4.0 61.88 MB

Code repo for Delaware River Basin machine learning models that predict dissolved oxygen

License: Creative Commons Zero v1.0 Universal

R 69.29% Python 28.30% Dockerfile 0.92% Shell 0.13% Stan 1.36%

drb-do-ml's People

Contributors

Stargazers

Watchers

Forkers

lekoenig jsadler2 galengorski amcarter

drb-do-ml's Issues

examine hidden states patterns with fewer states

In our discussion on 2/16, the question came up of what would the hidden states look like if we did fewer states? That is, the analysis in #45 had 10 hidden states. What would this look like if we only had 5?

Check correlation between temperature and DO predictions

We are interested to see the correlation between model DO predictions and input air temperature in the Baseline model.

Add discharge as a predictor?

In one of our meetings, @amcarter suggested that discharge could be an important predictor of DO dynamics. Thus far this has not been included as an input variable.

Move to NHD network?

I would like for us to think seriously about moving from the PRMS to the NHD network.

Pros:

much finer spatial resolution
won't have to do a cross-walk from PRMS to NHD for the static attributes (#51); the work for this has already been done by @lekoenig, but it does add a bit of complexity
maybe more easily portable to another region
more widely used/available

Cons:

We'd need to gather our own met drivers on the NHD network
diverges from other projects like Inland Salinity

unusual data from site `01482537`

I'm getting unusual data from readNWISuv for site 01482537 (https://waterdata.usgs.gov/nj/nwis/inventory/?site_no=01482537)

head(dataRetrieval::readNWISuv(siteNumbers = "01482537",parameterCd="00300",startDate = "",endDate = "",tz = "UTC") %>% 
    dataRetrieval::renameNWISColumns(p00300="Value",p00095="Value"))
  agency_cd  site_no            dateTime at.0.5.ft.depth_Value_Inst
1      USGS 01482537 2008-08-06 17:45:00                       13.5
2      USGS 01482537 2008-08-06 18:00:00                       13.3
3      USGS 01482537 2008-08-06 18:15:00                       14.7
4      USGS 01482537 2008-08-06 18:30:00                       14.7
5      USGS 01482537 2008-08-06 18:45:00                       14.8
6      USGS 01482537 2008-08-06 19:00:00                       14.5
  at.0.5.ft.depth_Value_Inst_cd at.2.0.ft.depth_Value_Inst
1                             A                         NA
2                             A                        9.6
3                             A                       10.5
4                             A                       10.3
5                             A                       10.6
6                             A                       11.6
  at.2.0.ft.depth_Value_Inst_cd tz_cd
1                          <NA>   UTC
2                             A   UTC
3                             A   UTC
4                             A   UTC
5                             A   UTC
6                             A   UTC

I think it has something to do with the at.0.5.ft.depth_Value_Inst column names and that throws off some downstream processes.

For reference. When I do a different site (01472104) I get this:

head(dataRetrieval::readNWISuv(siteNumbers = "01472104",parameterCd="00300",startDate = "",endDate = "",tz = "UTC") %>%
     dataRetrieval::renameNWISColumns(p00300="Value",p00095="Value"))
  agency_cd  site_no            dateTime Value_Inst Value_Inst_cd tz_cd
1      USGS 01472104 2007-10-01 05:00:00        6.9             A   UTC
2      USGS 01472104 2007-10-01 06:00:00        6.7             A   UTC
3      USGS 01472104 2007-10-01 07:00:00        6.6             A   UTC
4      USGS 01472104 2007-10-01 08:00:00        6.5             A   UTC
5      USGS 01472104 2007-10-01 09:00:00        6.5             A   UTC
6      USGS 01472104 2007-10-01 10:00:00        6.5             A   UTC

Then the columns are Value_Inst.

Need a site list target with lat/lon

For the match-segments task (issue #16) we will need a site list target that contains the following information in tabular form:

- site name/identifier
- lat/lon
- datum specified
- data source I expect this variable to contain the following values: "WQP" (discrete samples); "daily" (nwis "dv" sites where instantaneous data is not available); and "instantaneous" (nwis "uv" sites)
- n_obs_days the number of observation-days; I suggest going with observation-days so that this metric is a better apples-to-apples comparison across sites. Some instantaneous sites report data every 5 minutes whereas others contain hourly data, so the absolute number of observations would be very different for multi-year deployments. I expect n_obs_days to be approximately equivalent to the number of total observations for discrete samples since multiple samples are usually not collected within the same day.

Anything else we would want here?

evaluate initial model

we will need to evaluate initial model once we have our predictions.

download stream network adjacency matrix

this data is on the temperature sb item: https://www.sciencebase.gov/catalog/item/5f6a289982ce38aaa2449135

Explicitly include GPP, ER, Gas exchange in model outputs

One possible approach to adding process guidance to the vanilla deep learning model is to explicitly represent the DO mass balance by outputting GPP, ER, and Gas exchange:

Baseline LSTM (#40):

Proposed:

set earliest date for data retrieval

similar to the inland salinity project, I would like to have an earliest_date argument so we aren't pulling data from before 1979-10-01

multiple DO values per timestep at one site

I was trying to visualize the DO data and I ran across this quirk. There are 8 time steps for site 01467200 that have two values each:

Create data summary plots in R

I'm actually kind of thinking we maybe should take out the Python plots too 😄. They kind of take a while to build and then without them we can get rid of those csv targets in 1_fetch.R. They were nice to have to get a feel for the data, but I'm not sure we need to keep them around. I could be pursuaded otherwise tho about both the python targets and the plots from r.

Maybe we do a happy medium and produce one summary plot with ggplot (to avoid the python rigmarole)

Originally posted by @jsadler2 in #32 (comment)

Evaluate the utility of adding more sites

In our baseline LSTM model, we filter for sites with at least 300 DO observation-days, resulting in 13 sites in the Lower DRB. We could think about whether it's worth adjusting this criteria so that we have more sites to work with. There's probably a tradeoff between having more information at a site versus having more sites but with potentially less information at any given site - but that also seems like a potentially interesting modeling question.

Some gut-check steps might include:

How many sites do we gain by changing the criteria to 200 days? 100 days?
Our static and dynamic features are aggregated at the scale of PRMS segments, so how many new (unique) PRMS segments do we gain by changing the criteria to 200 days? 100 days?

add functionality to make it easy to change target output locations

For the use case where we are working on Tallgrass/Caldera, it would be helpful to make the location where the targets are built easily changeable. That way we can have a central data store on a shared folder in Caldera. In addition to it being shared (which will be nice so that targets don't have to be rebuilt unnecessarily), it's better to have the actual data in Caldera as opposed to our home directories.

How does DO saturation work?

Exposing my ignorance here. How does DO saturation work?

In my googling, I found this nice monogram (https://dep.wv.gov/WWE/getinvolved/sos/Documents/SOSKit/DOSaturation.pdf):

That makes me think that for every temperature, there is one and only one saturation DO concentration. For example, I draw a line between 10 deg C and 100% and I get something like 11 mg/l. That is what I would answer is the DO saturation concentration for 10 deg C.

But ... when I look at the DO saturation values from here (https://www.sciencebase.gov/catalog/item/59eb9c0ae4b0026a55ffe389) and compare them to the water temperature values (also from that SB item), I get a range of "DO.sat" for each "water.temp":

Why is that? Is it because of differences in air pressure?

Initial meteorological driver inventory

Need to inventory the initial driver data available in the DRB

Comparison of hidden states to LAI

At our last meeting (2/15) we discussed comparing hidden state 3 to the LAI of the different sites.

Here, four focal sites for which LAI was available, we observed that H3 was similar for the first three, but quite different for the coastal plain site 01466500 (lower right)

The LAI for these four sites does not explain this difference.

The coastal plain site (upper left in this figure) does have the highest summer LAI of the four, but it is highly seasonal, not explaining the constant H3 state above

baseline LSTM model

need to implement and test the baseline LSTM model.

train/val/test splits

We will need to come up with splits for train/validation/test splits.

add dockerfile

for portability (to HPC and across team members), we should have a dockerfile that we can use to make our environment.

Create map of study domain

We'll need a map of the lower DRB for the paper, presentations

why is 01481500 "tidal"?

@jsadler2, do you remember how you determined 01481500 is tidal?

Originally posted by @lekoenig in #40 (comment)

visualize model results with rmarkdown dashboard

We need some code to visualize the model results. I think it would be pretty nice to have a little rmarkdown dashboard template that we could just give the model name to and it would render the dashboard with plots and summary information.

Gather sntemp/prms segment/catchment attributes

For a first crack, I think we should try the model with just the sntemp/prms segment/catchment data. Then we can add NHD/streamcat data later on. So we will need to "gather" this data.

I'm not sure exactly how best to do this. It's part of the delaware_model_prep pipeline, but

it's not fully integrated into that pipeline yet
i don't know the best way to leverage that. Maybe we just say: "this is where we got it" and leave it at that, trusting the other pipeline.

Why is 01466500 so different?

Site 01466500 is different. Not bad. Just different.

https://waterdata.usgs.gov/monitoring-location/01466500

This came to our attention when we looked at the hidden states (#52 (comment)).

Initial Lower DRB DO data inventory

We need to know how much DO data there is available in the lower DRB

Look into warnings associated with new version of dplyr

I'm getting some warnings after updating some packages within tidyverse (now running dplyr_1.0.8 and tibble_3.1.6):

p2_sites_w_segs: Using across in filter is deprecated, use if_any or if_all
p1_daily_data_700019ef: The .data argument of add_column must be a data frame as of tibble 2.1.1.90mThis warning is displayed once every 8 hours.

Look into these and decide whether any changes to the code are needed.

gather stream temperature data

The model proposed in #56 includes stream temperature so that we can calculate the DO saturation level. So we will need to gather water temp data too.

match point observations to stream segments

we will need to match to the point observations to the stream segments. Once again, the temperature project has blazed the trail here, so we can look to and hopefully be able to reuse their code.

reproduce trained model predictions using targets pipeline

we need some code in the targets pipeline to reproduce model predictions.

this code will need to do a few things:

generate the prepped.npz file using the prep_all_data function from river_dl
loop through the directories and find all of the trained_weights subdirectories
use the predict functions from river_dl to make predictions from the model

Gather additional static input variables for baseline LSTM v2

We're interested in adding input variables to improve from baseline. From discussion #44, the running list of new input variables includes:

Dynamic

~~daily water temperature from PRMS (seg_tave_water in p1_prms_met_data)~~ Holding off on water temperature for now
daily min/max air temperature in addition to the mean (get from gridmet primary climate variables)
daily vapor pressure deficit (get from gridmet derived variables); alternatively, min/max relative humidity if VPD not included in output from drb_gridmet_tools
daily solar radiation (either use seginc_swrad from PRMS outputs, or sw-rad from gridmet primary climate variables)
~~daily potential evapotranspiration, PET (from seg_potet in PRMS outputs)~~ Holding off on this for now since we're now working at the NHD-scale

Static

soil characteristics from STATSGO, including information about texture (percent clay, sand, etc.) and conductivity (e.g. permeability, rainfall-runoff factor, soil thickness, bulk density (~porosity), depth to seasonally high water table, and average water capacity). Get from Wieczorek dataset referenced to NHDPlusv2.
mean daily or annual precipitation (get from StreamCat?) pulling watershed mean annual precip (mm) based on 800m PRISM data (1971-2000) from Wieczorek dataset.
Percent impervious surface within local catchment ~~/HRU~~, NLCD (get from Wieczorek dataset referenced to NHDPlusv2)
Percent forest within local catchment ~~/HRU~~ , NLCD (get from Wieczorek dataset referenced to NHDplusv2)

We need to:

gather each of the above input variables ~~aggregated at the scale of the PRMS segment~~

implement DO model with explicit metabolism representation

Implement the changes proposed here #47 (comment) to explicitly represent metabolism within the DO model.

adjust train/val/test times to accomodate metabolism data

In the baseline model, our partitions have been as follows:

partition	start	end
train	1980-01-01	2017-01-01
validation	2017-01-01	2019-01-01
test	2019-01-01	2022-01-01

These will have to change because the metabolism data only goes from Fall 2007 to Fall 2016 (#57)

Filter daily metabolism estimates based on model diagnostics

The LSTM-metab model (see #47) will incorporate daily values of GPP, ER, and K600 when evaluating model performance. All daily values aren't created equal, however, so we should assess whether we want to filter those data that are used in our model.

The Appling et al. data release contains a separate item called "diagnostics" (https://www.sciencebase.gov/catalog/item/59eb9bafe4b0026a55ffe382) that includes a variable, model_confidence, that provides an automated model assessment:

Metabolism was estimated using the streamMetabolizer package (Appling et al. 2017, 2018), then key diagnostics and assessments were derived from the model outputs. Raw outputs and additional diagnostics are available in the "fit" output files included in this data release.
A final automated model assessment of High (H), Medium (M), or Low (L) overall confidence in each model was computed from these key diagnostics using a decision tree: Models were assigned Low confidence if K600_daily_sigma_Rhat > 1.2, err_proc_iid_sigma_Rhat > 1.2, K_range > 50, neg_GPP > 50, and/or pos_ER > 50. Models were assigned Medium confidence if they were not assigned Low confidence, and K_range > 15, neg_GPP, and/or pos_ER > 25. Models were assigned High confidence if they were not assigned Low or Medium confidence.

We should join this diagnostics to our p1_metab target and use the diagnostic value to filter the metabolism data.

zarr datatype error

I'm getting this datatype error when trying to write to zarr

ValueError: unable to infer dtype on variable None; xarray cannot serialize arbitrary Python objects

I have these package versions:
xarray 2022.3.0
zarr 2.11.1

commit model weights

to ensure reprodicibility, we will start committing model weights to the repo. these are quite small (~14kb).

gather sntemp/prms meteorological data

this should be pretty simple. just need to pull the file from the Temperature SB item: https://www.sciencebase.gov/catalog/item/5f6a289982ce38aaa2449135 [wrong link earlier]

subset sntemp/prms network to just "Lower Delaware"

We will need to subset the full DRB sntemp/prms network to just the "Lower Delaware."

Error when building p1_inst_data

I encountered an error while building the target p1_inst_data that stops the pipeline. The error ( object 'Value_Inst_cd' not found) is due to the fact that NWIS site 01467042 is within the list of sites in p1_nwis_sites_inst but no available data overlap the download period. Looking at the NWIS page, it does appear that the instantaneous DO data are only available starting March 2022. Maybe they removed some data from NWIS?

We'll need to add an argument to the filter steps in p1_nwis_sites_inst so that a site is only retained if the inst dates begin before dummy_date.

gather metabolism data

To explicitly represent metabolism in the model (i.e., to do #56) we will need to gather metabolism data: GPP, ER, and K. Metabolism estimates for 6 sites in our study area (7 if we add back 01481500 (see #50)) are included in the Appling et. al data release. Included in those are our 2 validation sites.

Matching to segments question

From @lekoenig

Is 0.1 the default search_radius in nhdplusTools? Effectively, this nearest neighbor function will search within that radius (~10 km, I think) and if any segments are returned, we just want the closest one (e.g. maybe that line is 5 m or 5 km away). One thing I hadn't considered before is that with this approach, we may be incorrectly matching samples taken in small, headwater streams to larger river segments, which are the only ones represented by the PRMS lines. Do you think that's true?

It looks like 50% of sites have offsets at or below 0.005 (~0.5 km, I think), but some can be higher, up to our max search_radius

Error when building p3_well_observed_site_data_json

When I try to run tar_make to build the pipeline I'm getting an error associated with p3_well_observed_site_data_json:

* start target p3_well_observed_site_data_json
GeoJSON' layer not supported by driver `
Deleting layer `well_observed_trn_val_test' failed
Writing layer `well_observed_trn_val_test' to data source 
  `3_visualize/out/well_observed_trn_val_test.geojson' using driver `GeoJSON'
Updating existing layer well_observed_trn_val_test
Writing 14 features with 12 fields and geometry type Point.
Unknown field name `bird_dist_to_subseg_m': updating a layer with improper field name(s)?
x error target p3_well_observed_site_data_json

Initial catchment attribute data

We need to catalog which catchment attributes we have available in the DRB

Create function to estimate daily (normalized) light

For the process guidance equations for DO-max outlined in #47, we want some estimate of daily light to include in the model inputs. Two options we've discussed:

write a wrapper function that uses calc_light() in streamMetabolizer to estimate a single value of Lt/L-sum for each day. This value represents the daily maximum light relative to the daily cumulative light.
either estimate or grab the estimates of day.length from the Appling data release and use the ratio of day length / 24 hours as a proxy for the ratio of max light to the daily sum.

aggregate inst. data to daily summaries

need to aggregate inst. data to daily summaries

Generic function for downloading sb data

It would be good to have a more generic function to download data from science base. I'm thinking that the arguments for this function will be sb_id, filename, out_dir.

Create target for "medium_observed" sites

Following up on #49, we'll need a target that reflects "medium-observed" sites (i.e., sites with at least 100 observation-days) to assess the impact of lowering our threshold for including a site in the LSTM model.

Inspect hidden states for process understanding

An idea that has come up is to inspect the hidden states to see if they are behaving as we would expect some state or flux in the process would behave.

The two examples that have come up are:

a biomass: some representation of a biomass that accumulates in the summer, is lower in the winter, and maybe has a sharp decrease after a scouring event
discharge: we aren't giving the baseline model discharge as an input. Does the model have a discharge-like hidden state?

We can look into answering these questions with the baseline LSTM model (#40).

make `omit_nwis_sites` a target

Currently the omit_nwis_sites is a character vector that is defined in _targets.R. The problem with this is that if/when we update this vector, targets doesn't recognize that a "dependency" has changed. That is because it's not a targets::tar_target. So I think it would be best to define it as it's own target. This is how Jake & Co. approached similar situations in the temp forecasting pipelines.

cc @jds485 - we'll need to think about this for the salinity pipeline too

create a base snakefile and import as a module

to avoid copy/pasting snakefile contents for each experiment, we should use the module capability of snakemake