usgs-r / 2wp-temp-observations Goto Github PK
View Code? Open in Web Editor NEWA pipeline to assemble water temperature data for the 2WP water temperature modeling
A pipeline to assemble water temperature data for the 2WP water temperature modeling
Recently, we had to change our inventory of national WQP to iterate through states. We weren't totally satisfied with this approach (see Issue #22). Jim Kreft mentioned we can now query by year, so I'd like to try that approach.
I think this is a relatively small code change. The state iteration happens here, so would need to be replaced with a "year" iteration. The heart of the iteration through states happens here.
I think the steps are:
5_data_munge/out/wqp_daily_nodepths.rds
file. But for experimentation, this could be a few recent years.dataRetrieval::whatWQPdata
). How long does it take to query for sites by a year? Can we query by multiple years? Does this query return sites without state names, that is, does it solve the problem in #22?input of stationId, lat, long required
output would then include a reachId
We currently only pull through HUC 21, which misses some territories in HUC 22 (see Julie's description and solution here).
Seems this variable is missing. use download_location <- tempfile(pattern = ".zip")
instead?
Because the temperature records are from many organizations, methods, etc., we expect that there will be quality issues in a minority of the records. These issues might include but are not limited to:
This is challenging to do at a national scale because it is hard to generalize what the temperature should be at any given point and time. Sites that are further south will be warmer, but things like reservoirs and ground water inputs can make the temperature look different than you would expect (e.g., cooler temperatures in summer). The goal is to flag the egregious values.
I noticed that we are pulling far fewer temperature sites than are in the inventory for WQP. Why is this? Some known reasons:
We should create a comparison in the pipeline for review, and should follow up by investigating some sites (particularly if there are sites where the inventory says there is a lot of data, but we don't get any back). This isn't new to the recent pulls (see this PR).
Was just looking at the code to steal a solution I knew was in here and noticed that temperature depths are filtered to just those > 0 here
Maybe I am missing part of the picture, but seems this would rule out exact surface temperatures, of which there are many.
Some outliers in the Eco-SHEDS data appear to be in Fahrenheit, not Celsius, as the metadata suggests. Should we convert to Celsius or just drop these values, because we'll never be sure?
# read in data and look at egregious values (>45 deg C)
dat <- readRDS('4_other_sources/out/ecosheds_data.rds')
high <- filter(dat, mean >= 45)
high_sites <- filter(dat, location_id %in% unique(high$location_id))
# note each series (which is series_id and color in the plot) is considered
# a single continuous time series of observed temperatures (i.e. a deployment)
ggplot(high_sites, aes(x = date, y = mean)) +
geom_point(size = 0.5, alpha = 0.5, aes(color = factor(series_id))) +
facet_wrap(~location_id, scales = 'free_x')
Sam and I discussed the different approaches to QAQC daily temperature data. We inspected the sites with the most outliers using the first QAQC attempt (grouping by latitude bins and month). Below is a list of sites that appear to be outliers relative to other sites. We investigated these sites, and the data appear to be real (not bad data). These particular sites are usually warmer (especially in winter months) since these sites seem to be impacted by thermal vents/springs.
Some ecosheds sites have multiple sensors, which I believe is being logged in the field "series_id". We should retain this as "sub_location" in the final data.
Certain characters/spaces in either the organization or siteid cause calls to WQP to fail. This resulted in many (50ish) targets to not build. If we can identify patterns in these issues, we can create a filter on the WQP inventory
step that will help limit these issues.
Some sites/patterns that I've been able to narrow in on that fail:
COE/ISU
(several sites) -- forward slash in the organization ID may be problemALABAMACOUSHATTATRIBE.TX_WQX
(several sites) -- period in the organization id may be problemRCE WRP
and SAN PASQUAL
-- space in organization id may be the problem1_wqp_pull/out/wqp_data.rds.ind
- most pulls (of 395 pulls) take 1.5-4 minutes. Full pull time ~19.5 hours + 15 minutes to bind and write data.1_nwis_pull/out/nwis_dv_data.rds.ind
- most pulls (of 36 pulls) take ~10 minutes, total pull time was ~5 hours + minimal bind and write time (~1 minute).1_nwis_pull/out/nwis_uv_data.rds.ind
- most pulls (of 230 pulls) take 1-2 minutes, total pull time was ~5 hours + minimal bind and write time (~2 minutes).The master
branch of this repository will soon be renamed from master
to main
, as part of a coordinated change across the USGS-R and USGS-VIZLAB organizations. This is part of a broader effort across the git community to use more inclusive language. For instance, git, GitHub, and GitLab have all changed or are in the process of changing their default branch name.
We will make this change early in the week of February 6, 2022. The purpose of this issue is to give notification of the change and provide information on how to make it go smoothly.
When this change is made, forks and clones of the repository will need to be updated as well. Here are instructions for making the necessary changes.
master
to main
(<your repository> -> Settings -> Branches).First, update your local git configuration so that new repositories created locally will have the correct default branch: git config --global init.defaultBranch main
.
Now, for any forks, do the following:
master
to main
.git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a
Like in #5, the latest pull is revealing new site IDs that have bad MonitoringLocationIdentifier
s that make the calls to WQP fail. In this case, it's TOHONOO'ODHAMNATION
with the tickmark in the name that's causing the fail.
Instead of specifying types, should we filter out the ones we know we aren't going to use. We can always to downstream filtering to remove data that doesn't meet our needs.
e.g.
sites <- whatWQPdata(characteristicName = c("Temperature, sample",
"Temperature, water",
"Temperature",
"Temperature, water, deg F")) %>%
filter(!ResolvedMonitoringLocationTypeName %in% c("Aggregate groundwater use",
"Well",
"Estuary",
"Ocean",
"Subsurface",
"Aggregate groundwater use",
"Atmosphere",
"Aggregate groundwater use "))
(but parameterized like you have in the config)
I'm trying to create a concise crosswalk between sites and what reach they were matched to. In your plotting examples, you use the column seg_id_reassigned
to plot the matched segments, so I assume this is the "matched" column. While doing this, I realized there are many duplicated site ids in the file 6_network/out/site_flowlines.rds.ind
, which I think means a site was matched to multiple reaches. I think this has to do with the part of the algorithm that looks to see if the site is closer to the endpoint of the upstream reach, but I'm not sure how to resolve the data to a single matched reach per site id.
library(scipiper)
library(dplyr)
library(ggplot2)
# id is a numeric ID that was created and is 1:nrow(sites)
matched <- readRDS(sc_retrieve(''6_network/out/site_flowlines.rds.ind", 'getters.yml')) %>%
group_by(matched, id) %>%
mutate(n = n())
# how many sites have more than one match?
length(unique(matched$id[matched$n > 1]))
# [1] 6935
# look at a site that was matched four times (site with the most matches)
top <- filter(matched, n == 4)
# plot the four reaches that were matched to this site, plus the original reach match
top <- filter(matched_dups, n == 4)
# original match
original <- filter(matched, seg_id_reassign %in% '31347') %>% distinct(Shape)
ggplot() +
geom_sf(data =top$Shape, aes(color = factor(top$seg_id_reassign))) +
geom_sf(data = top$Shape_site, color = 'red') +
geom_sf(data = original$Shape, color = 'black')
Some uv sites have so much data that the call to NWIS takes so long, but does not fail. For example, in the latest run, partition 240 ran overnight (tried two times) and still was not successful. Wonder if there's a way to time a call out after XX minutes or something? The problem is that the call to whatNWISdata
does not give an accurate representation of what data are available for a given site.
Here is partition 240, and site 405356112205601
has high frequency data at multiple depths. Most sites that cause problems are high frequency lake sites with multiple depths. This site is Great Salt Lake. I dropped that site number from the partition and completed the partition 240 download outside of loop_tasks
, but a better solution should be implemented. The easiest solution would be to filter this site.
The Daily Temperature data has 468887 "NA" dates. These are the dates associates with the Ecoshes source.
While developing the national-flow-observations pipeline, Lindsay made some change to the way date is used in creating the pull task ID, what functions are used to pull the data from NWIS, and what dates are use in the pulls from NWIS. I think these are safer, both in terms of unnecessary rebuilds, and are more fool-proof, since I myself have forgotten to change end date when I wanted to pull the most recent data.
In the flow pipeline, these are the major changes I want to implement:
to convey the density of data available by reach
Recently, we had to make a change to how we are inventorying WQP data to use in our partitions. Instead of a national call to whatWQPdata
(which used to work, but now throws a 504), we are looping through states and asking whatWQPdata
in each state. Sometimes, we have to split the state in cases where the state has too much data (e.g., Florida).
This solution works great for sites that have a state label. Some do not. In some instances, these sites are out of the country and should not be included in our dataset. In other instances, the sites are missing the state label in error, and should be included in our dataset. This amounts to thousands of sites, with > million records.
The real problem is that we don't know we are missing these sites, only that I know they're missing because I've archived old pulls and made comparisons.
Possible solutions I have thought of:
We pull this characteristic, but I think it is the temperature of a water grab sample. Note really cold temperatures throughout the year:
sample_data <- readRDS('1_wqp_pull/out/wqp_data.rds') %>%
filter(CharacteristicName %in% 'Water, sample') %>%
mutate(doy = lubridate::yday(ActivityStartDate)
ggplot(sample_data, aes(x = doy, y = ResultMeasureValue)) +
geom_point()
For now will remove these after the WQP pull (in munge_wqp_files.R
) but we should remove this from the WQP pull params.
@limnoliver had a few question after viewing approach II (issue #26) results:
Currently we choose the column (or sensor) with the most data, but some hints from the DRB pipeline suggest that we should check the output of this exercise. See the code below from the DRB NGWOS pull:
# retrieve remaining sites from NWISuv
new_ngwos_uv <- dataRetrieval::readNWISuv(siteNumbers = missing_sites, parameterCd = '00010')
uv_long <- select(new_ngwos_uv, site_no, dateTime, ends_with('00010_00000')) %>%
tidyr::gather(key = 'temp_column', value = 'temp_c', - site_no, -dateTime)
uv_site_col <- filter(uv_long, !is.na(temp_c)) %>%
group_by(site_no, temp_column) %>%
summarize(n_vals = n(),
n_dates = length(unique(as.Date(dateTime)))) %>%
filter(!grepl('piezometer', temp_column, ignore.case = TRUE))
# always choose the standard temp column. In cases where that is missing, choose the one on that day
# with the most data
# first take day-temp type means
uv_long_dailies <- filter(uv_long, !is.na(temp_c)) %>%
filter(!grepl('piezometer', temp_column, ignore.case = TRUE)) %>%
group_by(site_no, date = as.Date(dateTime), temp_column) %>%
summarize(temp_c = mean(temp_c),
n_obs = n()) %>%
left_join(select(uv_site_col, site_no, temp_column, n_dates))
# find the temperature for each site-day
# first choose standard temp column, then choose one with most data when available
uv_dat <- uv_long_dailies %>%
group_by(site_no, date) %>%
summarize(temp_c = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
temp_c[which(temp_column %in% 'X_00010_00000')], temp_c[which.max(n_dates)]),
temp_column = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
'X_00010_00000', temp_column[which.max(n_dates)]),
n_obs = ifelse(grepl('X_00010_00000', paste0(temp_column, collapse = ', ')),
n_obs[which(temp_column %in% 'X_00010_00000')], n_obs[which.max(n_dates)])) %>%
mutate(source = 'nwis_uv') %>%
select(site_id = site_no, date, temp_c, n_obs, source)
Right now these data are not included in the final combined dataset. I don't totally recall why, but I think it was related to some of the records potentially being lake sites, or not knowing how we should handle multiple depths. One thing we could consider is to treat a depth as a "sub_location", similar to how sensors at different depths are handled in NWIS.
Similar to the solution implemented in delaware-model-prep, move all targets built with get
statements to getters.yml
. I started this process when adding new data in 4_other_sources
, but did not do for the whole pipeline.
To align w/ version of scipiper
we are using in temperature-prep
repo
682466f - similar to what we're doing in the NWIS pull, use .qs files for temporary files in WQP pull to speed up read/write.
See #15. Many sites (mostly/entirely from WQP?) have really strange lat-lon coordinates.
Currently, the workflow creates a loop to partition the calls to whatWQPdata
to limit the size. This is currently set to 1000 sites. Each call takes ~1 minute, so with ~316k sites with temperature data, the whole inventory takes ~5 hours.
If the process fails at any point, all progress is lost.
So, it seems like we need to write intermediate inventory files. I think I can just write these locally to a tmp folder and git ignore both the files and indicator files.
@aappling-usgs or @jread-usgs - off the top of your head, do you know of other places we've solved this (before I dive in)?
From #23:
- UV data munge was causing memory failures in R. My solution was to reduce down to daily mean values in the combine step, so that the raw data is not preserved in the shared cache. I think this approach is ok, since we have a reproducible pipeline + are not using the raw data.
I agree that this is OK, but I also want to document David's suggestion from standup that this pull could be done on one of the USGS clusters. This could potentially solve two problems: (1) the UV data munge probably(?) won't cause memory failures if the available memory is larger, and (2) in theory, though we've not yet tried this, having the data pull on a cluster would allow multiple people access to the raw data pull without going through the shared cache. Given the unknowns with each of these objectives, I'm not pushing hard for this switch, but let's at least keep it on the table.
If we did this, I think we'd do the pulling on a data transfer node (to be good cluster citizens) and then switch over to a login node -> SLURM-allocated job to get the larger memory needed to process the data.
I am reviewing the workflow of this repo for another project and I noticed that there seems to be a bug in wqp_partition_config.yml
: the note says the max WQP pull size is 25,000 but the code says 250,000.
Not sure if this is a typo, but seemed drawing attention to it if it is.
In WQP munge, I use a case_when statement. Alison pointed out that this could be greatly simplified.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.