usgs-r / 2wp-temp-observations Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 7.0 10.73 MB

A pipeline to assemble water temperature data for the 2WP water temperature modeling

R 100.00%

2wp-temp-observations's People

Contributors

Stargazers

Watchers

Forkers

limnoliver mhines-usgs wdwatkins aappling-usgs jordansread jzwart ratshan

2wp-temp-observations's Issues

Switch WQP inventory to year (rather than by state)

Recently, we had to change our inventory of national WQP to iterate through states. We weren't totally satisfied with this approach (see Issue #22). Jim Kreft mentioned we can now query by year, so I'd like to try that approach.

I think this is a relatively small code change. The state iteration happens here, so would need to be replaced with a "year" iteration. The heart of the iteration through states happens here.

I think the steps are:

install the latest version of dataRetrieval
figure out what the start year should be. This can be done by looking at data availability through time from the existing 5_data_munge/out/wqp_daily_nodepths.rds file. But for experimentation, this could be a few recent years.
Experiment with asking for all sites in the U.S. with temperature data by year (e.g., using dataRetrieval::whatWQPdata). How long does it take to query for sites by a year? Can we query by multiple years? Does this query return sites without state names, that is, does it solve the problem in #22?
Compare a whatWQPdata call using a year (say 2018) to how long it takes to pull a single state - say Wisconsin. Is it faster or slower?
If this seems promising, start changing the code to iterate through year versus through state. I think most of this happens here.

Incorporate hydrologic addressing into pipeline

input of stationId, lat, long required
output would then include a reachId

add US territories

We currently only pull through HUC 21, which misses some territories in HUC 22 (see Julie's description and solution here).

Missing variable in get WQP data

Seems this variable is missing. use download_location <- tempfile(pattern = ".zip") instead?

2wp-temp-observations/1_wqp_pull/src/wqp_pull.R

Line 172 in 02a423e

download_location <- paste0(temp_location,".zip")

QA issues to address

- Daily means from sub-daily observations. We should flag sub-daily data that deviates too far from other observations during that day. That is, if there were 5 observations with a mistaken entry (2.3, 2.3, 2.3, 2.3, 23), our calculated mean would be wrong. We should either flag the mean or drop that errant observation before taking the mean.
- Create flag for when the min/max are likely not true min/max. This could be as simple as flagging those observations where the min, max, and mean values are all the same, or when min/max are the same.
- Add checks to ensure min < mean < max
- See John's spot-check of PA and the algorithm flagging potentially "good" data as suspect.
- add flag similar to what NorWEST does, which is to flag any day with >3 degree mean temperature difference

QAQC temperature data for release

Because the temperature records are from many organizations, methods, etc., we expect that there will be quality issues in a minority of the records. These issues might include but are not limited to:

unit issues (reported in F or C, but don't make sense (e.g., units are C but temperature values > 40 deg).
sensor out of water (river levels dropped, or someone took sensor out of water and forgot to turn it off). Patterns in data will be more similar to air temperature (so may be warmer in summer, or colder in winter)
general data entry problems (order of magnitude wrong, dates with temperature that don't make sense, site ID or lat/long that may have been mis-entered.)
others?

This is challenging to do at a national scale because it is hard to generalize what the temperature should be at any given point and time. Sites that are further south will be warmer, but things like reservoirs and ground water inputs can make the temperature look different than you would expect (e.g., cooler temperatures in summer). The goal is to flag the egregious values.

Compare inventory vs pulled sites

I noticed that we are pulling far fewer temperature sites than are in the inventory for WQP. Why is this? Some known reasons:

some bad site IDs are dropped because we can't pull them
some sites are returned in the inventory but don't have data

We should create a comparison in the pipeline for review, and should follow up by investigating some sites (particularly if there are sites where the inventory says there is a lot of data, but we don't get any back). This isn't new to the recent pulls (see this PR).

Are zero depth values being filtered out?

Was just looking at the code to steal a solution I knew was in here and noticed that temperature depths are filtered to just those > 0 here

Maybe I am missing part of the picture, but seems this would rule out exact surface temperatures, of which there are many.

Possible Fahrenheit values in EcoSHEDS data

Some outliers in the Eco-SHEDS data appear to be in Fahrenheit, not Celsius, as the metadata suggests. Should we convert to Celsius or just drop these values, because we'll never be sure?

# read in data and look at egregious values (>45 deg C)
dat <- readRDS('4_other_sources/out/ecosheds_data.rds')
high <- filter(dat, mean >= 45)
high_sites <- filter(dat, location_id %in% unique(high$location_id))

# note each series (which is series_id and color in the plot) is considered 
# a single continuous time series of observed temperatures (i.e. a deployment)
ggplot(high_sites, aes(x = date, y = mean)) +
  geom_point(size = 0.5, alpha = 0.5, aes(color = factor(series_id))) +
  facet_wrap(~location_id, scales = 'free_x')

Outlier Detection Sites Tracking

Sam and I discussed the different approaches to QAQC daily temperature data. We inspected the sites with the most outliers using the first QAQC attempt (grouping by latitude bins and month). Below is a list of sites that appear to be outliers relative to other sites. We investigated these sites, and the data appear to be real (not bad data). These particular sites are usually warmer (especially in winter months) since these sites seem to be impacted by thermal vents/springs.

USGS-06036940
USGS-09416000
USGS-10265150
USGS-07049691

Use `ecosheds` `series_id` for multiple sensors at site

Some ecosheds sites have multiple sensors, which I believe is being logged in the field "series_id". We should retain this as "sub_location" in the final data.

Some sites cannot be retrieved and cause call to WQP to fail

Certain characters/spaces in either the organization or siteid cause calls to WQP to fail. This resulted in many (50ish) targets to not build. If we can identify patterns in these issues, we can create a filter on the WQP inventory step that will help limit these issues.

Some sites/patterns that I've been able to narrow in on that fail:

organization id COE/ISU (several sites) -- forward slash in the organization ID may be problem
organization id ALABAMACOUSHATTATRIBE.TX_WQX (several sites) -- period in the organization id may be problem
organization ids RCE WRP and SAN PASQUAL -- space in organization id may be the problem

Notes on build time

1_wqp_pull/out/wqp_data.rds.ind - most pulls (of 395 pulls) take 1.5-4 minutes. Full pull time ~19.5 hours + 15 minutes to bind and write data.
nwis inventories take ~40-50 minutes each.
1_nwis_pull/out/nwis_dv_data.rds.ind - most pulls (of 36 pulls) take ~10 minutes, total pull time was ~5 hours + minimal bind and write time (~1 minute).
1_nwis_pull/out/nwis_uv_data.rds.ind - most pulls (of 230 pulls) take 1-2 minutes, total pull time was ~5 hours + minimal bind and write time (~2 minutes).

NOTICE: upcoming default branch name change

The master branch of this repository will soon be renamed from master to main, as part of a coordinated change across the USGS-R and USGS-VIZLAB organizations. This is part of a broader effort across the git community to use more inclusive language. For instance, git, GitHub, and GitLab have all changed or are in the process of changing their default branch name.

We will make this change early in the week of February 6, 2022. The purpose of this issue is to give notification of the change and provide information on how to make it go smoothly.

When this change is made, forks and clones of the repository will need to be updated as well. Here are instructions for making the necessary changes.

On GitHub, change the default branch name from master to main (<your repository> -> Settings -> Branches).
Change the default branch on any forks you have, as well as local clones. (See details below)
If you have collaborators on this repository, let them know that they will need to change their forked/local repos. Point them to this issue to facilitate the process!
Search within your repository for "master" so that you can change references (e.g. URLs pointing to specific commit points of files) to point to "main" instead.
When you are done, feel free to close this issue!

Changing default branches on forks and local clones

First, update your local git configuration so that new repositories created locally will have the correct default branch: git config --global init.defaultBranch main.

Now, for any forks, do the following:

Go to <your fork> -> Settings -> Branches and edit the default branch from master to main.
Update the settings for your local clone of this fork to match this change.

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Investigate site/date with >1 obs/second

From the daily data, there is a single site-date with more than 86400 observations, which would be 1 obs/second. Maybe there were multiple sensors deployed?

`Year` inventory does not retrigger rebuild of past years

We converted this pipeline to do the WQP inventory by year chunks. Now, when you update the dummy date, it triggers a rebuild, but old targets (that is, year chunks that weren't affected by the new date) are not being re-inventoried. See this build status below when we retriggered the pull in 2022.

new sites with bad names

Like in #5, the latest pull is revealing new site IDs that have bad MonitoringLocationIdentifiers that make the calls to WQP fail. In this case, it's TOHONOO'ODHAMNATION with the tickmark in the name that's causing the fail.

change "master" branch to "main"

Should we remove the sitetypes we know we don't want and keep the ambiguous?

Instead of specifying types, should we filter out the ones we know we aren't going to use. We can always to downstream filtering to remove data that doesn't meet our needs.

e.g.

  sites <- whatWQPdata(characteristicName = c("Temperature, sample",
                                              "Temperature, water",
                                              "Temperature",
                                              "Temperature, water, deg F")) %>% 
    filter(!ResolvedMonitoringLocationTypeName %in% c("Aggregate groundwater use", 
                                                     "Well",
                                                     "Estuary",
                                                     "Ocean",
                                                     "Subsurface",
                                                     "Aggregate groundwater use",
                                                     "Atmosphere",
                                                     "Aggregate groundwater use "))

(but parameterized like you have in the config)

Multiple matched reaches

I'm trying to create a concise crosswalk between sites and what reach they were matched to. In your plotting examples, you use the column seg_id_reassigned to plot the matched segments, so I assume this is the "matched" column. While doing this, I realized there are many duplicated site ids in the file 6_network/out/site_flowlines.rds.ind, which I think means a site was matched to multiple reaches. I think this has to do with the part of the algorithm that looks to see if the site is closer to the endpoint of the upstream reach, but I'm not sure how to resolve the data to a single matched reach per site id.

library(scipiper)
library(dplyr)
library(ggplot2)
# id is a numeric ID that was created and is 1:nrow(sites)
matched <- readRDS(sc_retrieve(''6_network/out/site_flowlines.rds.ind", 'getters.yml')) %>%
     group_by(matched, id) %>%
     mutate(n = n())

# how many sites have more than one match?
length(unique(matched$id[matched$n > 1]))
# [1] 6935

# look at a site that was matched four times (site with the most matches)
top <- filter(matched, n == 4)

# plot the four reaches that were matched to this site, plus the original reach match
top <- filter(matched_dups, n == 4)
# original match
original <- filter(matched, seg_id_reassign %in% '31347') %>% distinct(Shape)

ggplot() +
  geom_sf(data =top$Shape, aes(color = factor(top$seg_id_reassign))) +
  geom_sf(data = top$Shape_site, color = 'red') +
  geom_sf(data = original$Shape, color = 'black')

write a catch for uv sites with too much data

Some uv sites have so much data that the call to NWIS takes so long, but does not fail. For example, in the latest run, partition 240 ran overnight (tried two times) and still was not successful. Wonder if there's a way to time a call out after XX minutes or something? The problem is that the call to whatNWISdata does not give an accurate representation of what data are available for a given site.

Here is partition 240, and site 405356112205601 has high frequency data at multiple depths. Most sites that cause problems are high frequency lake sites with multiple depths. This site is Great Salt Lake. I dropped that site number from the partition and completed the partition 240 download outside of loop_tasks, but a better solution should be implemented. The easiest solution would be to filter this site.

Fix Ecosheds Source Dates

The Daily Temperature data has 468887 "NA" dates. These are the dates associates with the Ecoshes source.

Follow new pull patterns in flow pipeline

While developing the national-flow-observations pipeline, Lindsay made some change to the way date is used in creating the pull task ID, what functions are used to pull the data from NWIS, and what dates are use in the pulls from NWIS. I think these are safer, both in terms of unnecessary rebuilds, and are more fool-proof, since I myself have forgotten to change end date when I wanted to pull the most recent data.

In the flow pipeline, these are the major changes I want to implement:

trigger the rebuild by manually changing the date which gets appended to the pull ID.
use readNWISdv and readNWISuv with end date set as "". readNWISdata does not allow for not specifying end date.

Create examples of how the reach data can be visualized

to convey the density of data available by reach

Missing WQP sites

Recently, we had to make a change to how we are inventorying WQP data to use in our partitions. Instead of a national call to whatWQPdata (which used to work, but now throws a 504), we are looping through states and asking whatWQPdata in each state. Sometimes, we have to split the state in cases where the state has too much data (e.g., Florida).

This solution works great for sites that have a state label. Some do not. In some instances, these sites are out of the country and should not be included in our dataset. In other instances, the sites are missing the state label in error, and should be included in our dataset. This amounts to thousands of sites, with > million records.

The real problem is that we don't know we are missing these sites, only that I know they're missing because I've archived old pulls and made comparisons.

Possible solutions I have thought of:

Looping by HUCs, though I strongly suspect there are many sites missing HUC labels as well.
Looping through bounding boxes across the US. This would use the sites lat/long, and as far as I know, there aren't missing values in those columns.

do not include "Temperature, sample" characteristic from WQP

We pull this characteristic, but I think it is the temperature of a water grab sample. Note really cold temperatures throughout the year:

sample_data <- readRDS('1_wqp_pull/out/wqp_data.rds') %>%
 filter(CharacteristicName %in% 'Water, sample') %>%
 mutate(doy = lubridate::yday(ActivityStartDate) 

ggplot(sample_data, aes(x = doy, y = ResultMeasureValue)) +
 geom_point()

For now will remove these after the WQP pull (in munge_wqp_files.R) but we should remove this from the WQP pull params.

QAQC Issue (#26 ) Follow-up

@limnoliver had a few question after viewing approach II (issue #26) results:

The number of sites with 1 observation on given day. How many observation are in each group?
We will find the number of observation and the number of flagged observation per group using the same approach that
we discussed during 9/25 meeting.
How often are we using the fixed-value? How often are thresholds calculation too wide (large std-value) or too narrow (small
std-value)?
We will find the number of groups that used the fixed value to calculated the threshold intervals. Then, we will find the
number of groups with a small and large std-values.
Is approach 2 missing any outliers (false-negatives)?
We will create time-series plots for sites that have low number of observation (e.g. n <3) and sites with high std-values
(e.g. std > 15) without flagged observation:
timeseries plots with day of the year on the x-axis, and daily temperature on the y-axis. Then, we will investigate any
suspicious values (e.g. a value that looks like an outlier but wasn't flagged as outlier).

column selection when multiple sensors at site

Currently we choose the column (or sensor) with the most data, but some hints from the DRB pipeline suggest that we should check the output of this exercise. See the code below from the DRB NGWOS pull:

  # retrieve remaining sites from NWISuv
  new_ngwos_uv <- dataRetrieval::readNWISuv(siteNumbers = missing_sites, parameterCd = '00010')

  uv_long <- select(new_ngwos_uv, site_no, dateTime, ends_with('00010_00000')) %>%
    tidyr::gather(key = 'temp_column', value = 'temp_c', - site_no, -dateTime)

  uv_site_col <- filter(uv_long, !is.na(temp_c)) %>%
    group_by(site_no, temp_column) %>%
    summarize(n_vals = n(),
              n_dates = length(unique(as.Date(dateTime)))) %>%
    filter(!grepl('piezometer', temp_column, ignore.case = TRUE))

  # always choose the standard temp column. In cases where that is missing, choose the one on that day
  # with the most data
  # first take day-temp type means
  uv_long_dailies <- filter(uv_long, !is.na(temp_c)) %>%
    filter(!grepl('piezometer', temp_column, ignore.case = TRUE)) %>%
    group_by(site_no, date = as.Date(dateTime), temp_column) %>%
    summarize(temp_c = mean(temp_c),
              n_obs = n()) %>%
    left_join(select(uv_site_col, site_no, temp_column, n_dates))

  # find the temperature for each site-day
  # first choose standard temp column, then choose one with most data when available
  uv_dat <- uv_long_dailies %>%
    group_by(site_no, date) %>%
    summarize(temp_c = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
                              temp_c[which(temp_column %in% 'X_00010_00000')], temp_c[which.max(n_dates)]),
              temp_column = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
                                   'X_00010_00000', temp_column[which.max(n_dates)]),
              n_obs = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
                             n_obs[which(temp_column %in% 'X_00010_00000')], n_obs[which.max(n_dates)])) %>%
    mutate(source = 'nwis_uv') %>%
    select(site_id = site_no, date, temp_c, n_obs, source)

Consider including `5_data_munge/out/wqp_daily_depths.rds` into final data

Right now these data are not included in the final combined dataset. I don't totally recall why, but I think it was related to some of the records potentially being lake sites, or not knowing how we should handle multiple depths. One thing we could consider is to treat a depth as a "sub_location", similar to how sensors at different depths are handled in NWIS.

Move data targets to getters.yml to avoid double builds

Similar to the solution implemented in delaware-model-prep, move all targets built with get statements to getters.yml. I started this process when adding new data in 4_other_sources, but did not do for the whole pipeline.

update to task_combiners branch of scipiper?

To align w/ version of scipiper we are using in temperature-prep repo

Use .qs files in WQP temporary files

682466f - similar to what we're doing in the NWIS pull, use .qs files for temporary files in WQP pull to speed up read/write.

correct/remove funky lat-lon pairs

See #15. Many sites (mostly/entirely from WQP?) have really strange lat-lon coordinates.

If WQP inventory step fails, all inventories lost

Currently, the workflow creates a loop to partition the calls to whatWQPdata to limit the size. This is currently set to 1000 sites. Each call takes ~1 minute, so with ~316k sites with temperature data, the whole inventory takes ~5 hours.

If the process fails at any point, all progress is lost.

So, it seems like we need to write intermediate inventory files. I think I can just write these locally to a tmp folder and git ignore both the files and indicator files.

@aappling-usgs or @jread-usgs - off the top of your head, do you know of other places we've solved this (before I dive in)?

Move processing to Denali?

From #23:

UV data munge was causing memory failures in R. My solution was to reduce down to daily mean values in the combine step, so that the raw data is not preserved in the shared cache. I think this approach is ok, since we have a reproducible pipeline + are not using the raw data.

I agree that this is OK, but I also want to document David's suggestion from standup that this pull could be done on one of the USGS clusters. This could potentially solve two problems: (1) the UV data munge probably(?) won't cause memory failures if the available memory is larger, and (2) in theory, though we've not yet tried this, having the data pull on a cluster would allow multiple people access to the raw data pull without going through the shared cache. Given the unknowns with each of these objectives, I'm not pushing hard for this switch, but let's at least keep it on the table.

If we did this, I think we'd do the pulling on a data transfer node (to be good cluster citizens) and then switch over to a login node -> SLURM-allocated job to get the larger memory needed to process the data.