kratzert / caravan Goto Github PK

A global community dataset for large-sample hydrology

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 87.05% Python 12.95%

dataset

caravan's Introduction

Hi there 👋

I'm Frederik Kratzert, Researcher @ Google working in the Flood Forecasting Team. This here is my private GitHub account, were I maintain my open source projects and publish code related to any kind of research articles.

🤓 Research 🤔

Most of my research is dedicated towards solving applications in environmental sciences (mainly hydrology) with machine learning.

caravan's People

Contributors

Stargazers

Watchers

caravan's Issues

[DATA CONTRIBUTION] GRDC-Caravan extension

Basin prefix

grdc

Zenodo DOI

https://zenodo.org/records/10074416

Number of catchments

5357

Location of catchments

Globally (25 countries)

For which periods are streamflow records available in your dataset?

1950-2023, varying length for each basin.

Please list any sources of the data you contributed.

GRDC/BfG.
National Hydrological and Hydro-Meteorological Services of WMO Member States of the 25 countries.

License

CC-BY-4.0

Additional context

Authors: Färber, Claudia; Plessow, Henning; Kratzert, Frederik; Addor, Nans; Shalev, Guy; Looser, Ulrich

The extension includes a subset of those hydrological discharge data and station-based watersheds from the Global Runoff Data Centre (GRDC), which are covered by an open data policy (Attribution 4.0 International; CC BY 4.0). In total, the dataset covers stations from 5357 catchments and 25 countries worldwide with a time series record from 1950 – 2023.

GRDC is an international data centre operating under the auspices of the World Meteorological Organization (WMO) at the German Federal Institute of Hydrology (BfG). Established in 1988, it holds the most substantive collection of quality assured river discharge data worldwide. Primary providers of river discharge data and associated metadata are the National Hydrological and Hydro-Meteorological Services of WMO Member States.

Because of the size of this extension, we provide an archive with all timeseries data as csv files and one archive with all timeseries data as netcdf files. Both are available from the Zenodo link.

Note: This extension contains basins of all sizes, ignoring the 2000km2 threshold. In order to be able to process really large basins on EarthEngine, we slightly adapted the script that computes the attributes, which will be pushed to this repository soon.

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread, where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

Missing attributes in many netCDF files

In the latest release of caravan around 6000 of the netCDF files are missing dataset-level attributes (and don't have variable-level attributes either).

The files missing attributes are in the list attached;
missing_attrs.txt

[DATA CONTRIBUTION] GAGES II

Basin prefix

gages-ii

Zenodo DOI

TBD

Number of catchments

~9,000

Location of catchments

United States

For which periods are streamflow records available in your dataset?

1900-present

Please list any sources of the data you contributed.

The GAGES II dataset consists of gages which have had either 20+ complete years (not necessarily continuous) of discharge record since he GAGES II dataset consists of gages which have had either 20+ complete years (not necessarily continuous) of discharge record since 1950, or are currently active, as of water year 2009, and whose watersheds lie within the United States, including Alaska, Hawaii, and Puerto Rico. Reference gages were identified based on indicators that they were the least-disturbed watersheds within the framework of broad regions, based on 12 major ecoregions across the United States. Of the 9,322 total sites, 2,057 are classified as reference, and 7,265 as non-reference. Of the 2,057 reference sites, 1,633 have (through 2009) 20+ years of record since 1950. Some sites have very long flow records: a number of gages have been in continuous
service since 1900 (at least), and have 110 years of complete record (1900-2009) to date.

License

CC0

Additional context

Is there interest in adding USGS's GAGES II dataset of ~9000 sites. We could easily pull the polygons from the National Hydrology Dataset (NHD). Data summary below. Of note, these are long-term streamgages but they may have data gaps, so you may prefer a subset: only reference, only complete, etc.

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread, where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

Caravan data hosted on OpenDAP server

Thanks for working on this and putting the data online! For our ewatercycle project we wanted easier access to the separate basins contained in the Caravans dataset. A data hosting service we have access to has an OpenDAP server, so we wanted to put it there.

I reorganized the data: added the attributes (units, basin properties) to the netCDF files, merged them per collections (i.e. one file per Camels), and compressed the netCDF files.

The data is available on:
https://doi.org/10.4121/ca13056c-c347-4a27-b320-930c2a4dd207

And can be accessed like this in xarray:

# open camels US:
ds = xr.open_dataset("https://opendap.4tu.nl/thredds/dodsC/data2/djht/ca13056c-c347-4a27-b320-930c2a4dd207/1/camels.nc")

# select the basin of interest:
ds.sel(basin_id=b"camels_01022500")

# plot the air temperature:
ds.sel(basin_id=b"camels_01022500")["temperature_2m_mean"].plot()

[DATA CONTRIBUTION] Caravan extension Denmark

Basin prefix

camelsdk

Zenodo DOI

https://zenodo.org/record/7396466

Number of catchments

308

Location of catchments

Denmark

For which periods are streamflow records available in your dataset?

01-01-1989 until 31-12-2019

Please list any sources of the data you contributed.

Koch, J., & Schneider, R. (2022). Long short-term memory networks enhance rainfall-runoff modelling at the national scale of Denmark. GEUS Bulletin, 49. https://doi.org/10.34194/geusb.v49.8292

License

Creative Commons Attribution 4.0 International

Additional context

No response

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread (TODO link), where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

[DATA CONTRIBUTION] CAMELS-ES

Basin prefix

Zenodo DOI

10.5281/zenodo.8373021

Number of catchments

269

Location of catchments

Spain

For which periods are streamflow records available in your dataset?

1st October 1991 to 30th September 2020

Please list any sources of the data you contributed.

The discharge records are taken from the "Anuario de Aforos" (https://ceh.cedex.es/anuarioaforos/default.asp),
a public repository of hydrological records in Spain curated by CEDEX (Centro de Estudios y Experimentación
de Obras Públicas) and supported by the gauging networks of the several water agencies in the country.

The European Flood Awareness System (EFAS) version 5.0 is part of the Copernicus Emergency Management Services (https://emergency.copernicus.eu/)

EFASv5 simulated discharge is publicly available in the Copernicus Climate Data Store (https://cds.climate.copernicus.eu)
EFAS static maps are publicly available in the Joint Research Centre data catalog (https://data.jrc.ec.europa.eu/dataset/f572c443-7466-4adf-87aa-c0847a169f23)
The EMO1 dataset is publicly available in the Joint Research Centre data catalog (http://data.europa.eu/89h/0bd84be4-cec8-4180-97a6-8b3adaac4d26 )

HydroATLAS is licensed under a Creative Commons Attribution (CC-BY) 4.0 International Licence (https://www.hydrosheds.org)

ERA5-Land hourly data is publicly available in the Copernicus Climate Data Store (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview)

The processing of HydroATLAS and ERA5 data was done using the Caravan GitHub repository (https://github.com/kratzert/Caravan.

License

CC BY 4.0

Additional context

The official network of gauging stations contains 1074 points. A selection has been made to remove stations with less than 8 years of data in the period 1991-2020, and those whose records show a clear alteration caused by the presence of reservoirs upstream. Unfortunately, after filtering, the distribution of catchments around the country is uneven, representing mostly the central and northern Spain. In the future I will explore the data sets from the water agencies, both to cover a larger part of the country and to explore if it would be possible to increase the temporal resolution of the time series.

Distribution of the selected and discarded gauging stations

In addition to the usual attributes and time series that the Caravan notebooks generate, I have added new attributes and the simulated discharge from the recently released EFAS version 5 (European Flood Awareness System).

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread, where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

[DATA CONTRIBUTION] Caravan extension Israel

Basin prefix

Zenodo DOI

https://doi.org/10.5281/zenodo.7758516

Number of catchments

Location of catchments

Israel

For which periods are streamflow records available in your dataset?

01/01/1970-01/01/2021

Please list any sources of the data you contributed.

https://data.gov.il/dataset/level_discharge
https://www.gov.il/he/departments/topics/water_authority_maps/govil-landing-page

License

CC-BY-4.0

Additional context

No response

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread (TODO link), where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

Bug affecting a handful of attributes

Hi all,

while working on the code for an upcoming change, I figured out that there is a bug that affects a handful of attributes that are derived from the most downstream HydroATLAS polygon that significantly intersects with the basin polygon. This is related to #15.

There is a part in the code where we traverse the HydroATLAS polygons along the NEXT_DOWN attribute until we reach a polygon that does not significantly overlap anymore with the basin polygon. This way, we can identify the most downstream HydroATLAS polygon and use this polygon to derive a couple of attributes that are defined for the pour point rather than for the polygon. This is e.g. the total upstream reservoir volume or the river area.

The problem is, that in a different part we are removing intersecting polygons if the intersecting area is not larger than a predefined threshold (5sqkm), ignoring the size of the HydroATLAS polygon at all. There are however polygons in HydroATLAS level 12 that are themselves smaller than 5 sqkm, so even with a 100% intersection, these polygons would be removed. The problem then is that if these small polygons are removed, our algorithm to detect the most downstream polygon might fail and identifies a wrong polygon.
Instead, we should also consider the overlap with the basin polygon (as also suggested by @jonschwenk in #15), when filtering out intersecting polygons. If we only look at the percentage of overlap though, we might run into different problems for small basins that e.g. only intersect with less than half of a single HydroATLAS polygon. Therefore, we will apply both filterings together.

I am currently working on adapting the code accordingly, then I will update the dataset with new attribute files and also reach out to the authors of the 3 extensions.

The affected attributes are

pour_point_properties = ['dis_m3_pmn', # natural discharge annual mean
                         'dis_m3_pmx', # natural discharge annual max
                         'dis_m3_pyr', # natural discharge annual min
                         'lkv_mc_usu', # Lake Volume
                         'rev_mc_usu', # reservoir volume
                         'ria_ha_usu', # River area
                         'riv_tc_usu', # River volumne
                         'pop_ct_usu', # Population count in upstream area
                         'dor_pc_pva', # Degree of regulation in upstream area
                        ]

[DATA CONTRIBUTION] Caravan extension Iceland

Basin prefix

lamahice

Zenodo DOI

https://www.hydroshare.org/resource/86117a5f36cc4b7c90a5d54e18161c91/

Number of catchments

Location of catchments

Iceland

For which periods are streamflow records available in your dataset?

Meteorological time series are available from 1950-01-01 to 2021-09-23. The streamflow records are of variable length. The earliest start date is 1950-01-01, with the most common end date being 2021-09-29.

Please list any sources of the data you contributed.

The streamflow records were obtained from the Icelandic Meteorological Office and the National Power Company of Iceland.

License

The Creative Commons Attribution NonCommercial 4.0 International License CC-BY-NC-4.0 applies to the streamflow measurements. The Creative Commons Attribution 4.0 International License CC-BY-4.0 applies to all other data (meteorological timeseries from ERA5-Land, shapefiles and attribute values).

Additional context

This data was originally published as the LamaH-Ice dataset by Helgason and Nijssen (2023). Compared to the original LamaH-Ice, the Caravan extension does not include basins with a strong anthropogenic or natural influence on streamflow (20 basins). The LamaH-Ice data description paper (in review as of October 9 2023) is available at https://essd.copernicus.org/preprints/essd-2023-349/

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread, where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

HydroAtlas attributes aggregation issues

I was comparing the output of the Caravan Part 1 code (HydroATLAS attribute aggregation) against my own version where I process everything locally without GEE. I have found some discrepancies and will explain them here.

Non-accumulated discrepancies

pop_ct_ssu: According to the BasinAtlas Catalog, this is a summed variable per-basin, so its aggregation strategy should be a sum as well, but Caravan is taking the mean.
gdp_ud_ssu: Same as pop_ct_ssu but there is also a gdp_ud_sav so you could eliminate this one.

Accumulated discrepancies

The largest discrepancies by far were for accumulated/pour point attributes. These are tricky to get right since many of the accumulated variables have large step-changes as we move downstream, so tributaries/junctions become problematic. If I understand correctly, the Caravan approach takes the HydroBasin with the largest up_area that meets some overlap threshold area as the "downstream basin." It's a reasonable approach but I think it's important that users know the possible limitations. Here is an illustration for the reservoir volume attribute (rev_mc_usu):

The small polygons are HydroATLAS basins; the red polygon is a watershed boundary delineated from MERIT-Hydro. HydroATLAS polygons are colored by their rev_mc_usu value, which is 0 for most polygons and large for a handful. The target watershed should have a rev_mc_usu value of 0, but the Caravan code (with the default MIN_OVERLAP_THRESHOLD=5) returns 99068. The figure above shows why that is--the MERIT basin delineation overlaps one of the HydroATLAS basins that isn't part of the intended watershed just enough (i.e. more than 5km^2) that it is sampled as the downstream basin.

The problem is that the MIN_OVERLAP_THRESHOLD thus needs to be specified for each basin to avoid this mistake for accumulated attributes. That's not really feasible, since a user would have to compare each of their basins with the HydroATLAS basins to determine if there are unintended overlaps and then set the threshold. I don't have a good solution to propose for the Caravan/GEE approach, but if it's helpful, I have a simple solution that works pretty well for my local (non-GEE) processing:

Compute the fraction of each HydroATLAS polygon that is within the watershed polygon.
Select the HydroATLAS polygon with the largest value from 1) as the first possible basin. This ensures that we are starting with a basin that is definitely within the watershed polygon.
"Walk" downstream using the next_down attribute of the possible_basin
For each "step" (i.e. new start_basin) that is taken, check that its value from 1) is above some threshold (I suggest 0.75 based on my testing). This ensures that the possible_basin is indeed a part of the target watershed polygon.
Quit when the condition for 4) is not met. The last possible_basin is then the downstream-most basin from which to sample the accumulated attribute.
(Some handling of cases where no basins meet 2) is required.)

Here's my code that does this:

    this_idx = np.argmax(df['frac_basin_cover'].values)
    possible_basin = df['hybas_id'].values[this_idx]
    while 1:
        this_basin = df['next_down'].values[this_idx]
        this_idx = np.where(df['hybas_id'].values==this_basin)[0]
        if len(this_idx) == 0:
            break
        else:
            this_idx = this_idx[0]
       
        if df['frac_basin_cover'].values[this_idx] < 0.75:
            break
        else:
            possible_basin = this_basin

I think something like this could be implemented in GEE. I have tested this for ~1000 watersheds across the Arctic against the Caravan implementation. Here's a snapshot of the largest discrepancies:

v_mean is my method (VotE), c_mean is Caravan. For these mean values, I have averaged the variable's value across all ~1000 watersheds. The errors are percentages (v_mean-c_mean)/v_mean). While I can't say that all the attributes' discrepancies are due to the above issue, the four that I have looked at in detail are.

Using percent errors as I have isn't a great method, since it scales with the range of the particular variable (e.g. in the above example, the range is from 0 to 99068 so the percent error is huge), but it highlights the variables that disagree between the two methods. To check this, here is a histogram of discrepancies for the rev_mc_usu attribute between VotE and Caravan:

The overwhelming majority of watersheds agree perfectly, but a non-insignificant fraction have very large discrepancies. I manually checked five of these large discrepancies, and each one was due to the above issue.

Anyway, I was hoping to contribute to the Caravan collection with Arctic watersheds but I would rather use my local aggregation methods since it seems to handle finding the downstream HydroATLAS basin more reliably. I think users of Caravan datasets should be aware of this possible issue in accumulated HydroATLAS attributes--while the issue is infrequent, it can make an enormous difference in the returned attributed value!

Caravan_extension_CH

Basin prefix

camelsch

Zenodo DOI

https://doi.org/10.5281/zenodo.7928595

Number of catchments

296

Location of catchments

Switzerland (and bordering regions of neighboring countries)

For which periods are streamflow records available in your dataset?

1981-01-01 to 2020-12-31

Please list any sources of the data you contributed.

Streamflow data was provided by the Swiss Federal Office for the Environment (FOEN/BAFU). Further data providers are referenced and acknowledged in the corresponding publication of CAMELS-CH, available at the Zenodo repository.

License

Creative Commons Attribution 4.0 International License [CC-BY-4.0]

Additional context

The Caravan_extension_CH is published together with a CAMELS-CH dataset that aside of daily hydro-meteorological time series and static attributes contains additional information, e.g. on glacier data annual time series, Swiss lakes and further attributes.

Checklist

I have uploaded my dataset on Zenodo, where it is accessible under the DOI provided above.
I used a basin prefix that is not yet used by any other Caravan sub-dataset (you can check this via the Data Contributions discussion thread, where all accepted Caravan contributions are listed).
Permissive License: My data is available under a license that is compatible with the Caravan CC-BY-4.0 license (the easiest way to be sure about this is if your data uses CC-BY-4.0, too).

Missing values in Caravan Attributes

HI,

May be it is already detected, if not probably an easy task.
I found 158 bassins where caravan's attributes are missing, mostly in GB collection but some in GR and AUS.
All have small area (less than suggested 100km2).

note: DK collection is all ok

Thank you and keep going!

Are the the correct ERA5-Land variables/bands being used?

The ERA5-Land hourly dataset on GEE provides (somewhat confusingly) two types time-aggregated bands for some variables. For example, there is total_precipitation and precipitation_hourly. The definitions of these according to the Data Catalog are as follows:

total_precipitation : [skipping stuff here related to defining precipitation] This variable is accumulated from the beginning of the forecast time to the end of the forecast step. The units of precipitation are depth in meters. It is the depth the water would have if it were spread evenly over the grid box.

total_precipitation_hourly : Same as 'total_precipitation' except not accumulated and only for the given forecast step.

Currently, Caravan code is using the (daily) sum of total_precipitation. My reading of the above descriptions makes me think this might be an error. It's not completely clear to me what "end of forecast step" means (what is the time of a forecast step?), but if the forecast step were an hour, I don't see why the _hourly bands would need to exist. This stackexchange post touches on this but doesn't really clarify what "end of forecast step" means.

To check this, I put together a basic GEE example to compute the (spatially) average daily precip in mm from daymet, ERA5L-Hourly total_precipitation, and ERA5L-Hourly total_precipitation_hourly for a single day. You can see that total_precipitation vastly overestimates the daily value, whereas the total_precipitation_hourly variable is the same order of magnitude as the Daymet.

If this is indeed an error, it seems like there are two options: use the x_hourly band instead of just x for the affected variables, or use the max() operator. You can see from my GEE example that using the max() gets you much closer to the benchmark value, but still doesn't agree with the x_hourly version.

In short, it seems that bands with corresponding _hourly bands are cumulative (relative to what, I'm not sure, maybe it resets at 00:00 each day), so taking their sum isn't what we want to do. I might be wrong here as I assume you'd probably notice this when you looked at Caravan precip time series and see that they're order-of-magnitude(s) larger than they should be?

If this is an error, it looks like it would apply to total_precipitation and potential_evaporation the most, but other bands such as surface_net_solar_radiation also have hourly versions that should probably be used. The error wouldn't be as bad for those since we're taking the mean for them.

lamahice attributes lat/lon values

Hi everyone!
Thanks a lot to all of you who contributed to the Caravan dataset and made it so large and helpful for large-sample studies!
I just realized that in the attributes_other_lamahice file, the latitude and longitude values seem to be switched.
I hope that this is the right place to address this issue.
Best wishes, Franziska

kratzert / caravan Goto Github PK

caravan's Introduction

Hi there 👋

🤓 Research 🤔

Links

caravan's People

Contributors

Stargazers

Watchers

Forkers

caravan's Issues

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Non-accumulated discrepancies

Accumulated discrepancies

Basin prefix

Zenodo DOI

Number of catchments

Location of catchments

For which periods are streamflow records available in your dataset?

Please list any sources of the data you contributed.

License

Additional context

Checklist

Recommend Projects

Recommend Topics

Recommend Org