I was comparing the output of the Caravan Part 1 code (HydroATLAS attribute aggregation) against my own version where I process everything locally without GEE. I have found some discrepancies and will explain them here.
Non-accumulated discrepancies
pop_ct_ssu
: According to the BasinAtlas Catalog, this is a summed variable per-basin, so its aggregation strategy should be a sum as well, but Caravan is taking the mean.
gdp_ud_ssu
: Same as pop_ct_ssu
but there is also a gdp_ud_sav
so you could eliminate this one.
Accumulated discrepancies
The largest discrepancies by far were for accumulated/pour point attributes. These are tricky to get right since many of the accumulated variables have large step-changes as we move downstream, so tributaries/junctions become problematic. If I understand correctly, the Caravan approach takes the HydroBasin with the largest up_area
that meets some overlap threshold area as the "downstream basin." It's a reasonable approach but I think it's important that users know the possible limitations. Here is an illustration for the reservoir volume attribute (rev_mc_usu
):
The small polygons are HydroATLAS basins; the red polygon is a watershed boundary delineated from MERIT-Hydro. HydroATLAS polygons are colored by their rev_mc_usu
value, which is 0 for most polygons and large for a handful. The target watershed should have a rev_mc_usu
value of 0, but the Caravan code (with the default MIN_OVERLAP_THRESHOLD=5) returns 99068. The figure above shows why that is--the MERIT basin delineation overlaps one of the HydroATLAS basins that isn't part of the intended watershed just enough (i.e. more than 5km^2) that it is sampled as the downstream basin.
The problem is that the MIN_OVERLAP_THRESHOLD thus needs to be specified for each basin to avoid this mistake for accumulated attributes. That's not really feasible, since a user would have to compare each of their basins with the HydroATLAS basins to determine if there are unintended overlaps and then set the threshold. I don't have a good solution to propose for the Caravan/GEE approach, but if it's helpful, I have a simple solution that works pretty well for my local (non-GEE) processing:
- Compute the fraction of each HydroATLAS polygon that is within the watershed polygon.
- Select the HydroATLAS polygon with the largest value from 1) as the first
possible basin
. This ensures that we are starting with a basin that is definitely within the watershed polygon.
- "Walk" downstream using the
next_down
attribute of the possible_basin
- For each "step" (i.e. new
start_basin
) that is taken, check that its value from 1) is above some threshold (I suggest 0.75 based on my testing). This ensures that the possible_basin
is indeed a part of the target watershed polygon.
- Quit when the condition for 4) is not met. The last
possible_basin
is then the downstream-most basin from which to sample the accumulated attribute.
- (Some handling of cases where no basins meet 2) is required.)
Here's my code that does this:
this_idx = np.argmax(df['frac_basin_cover'].values)
possible_basin = df['hybas_id'].values[this_idx]
while 1:
this_basin = df['next_down'].values[this_idx]
this_idx = np.where(df['hybas_id'].values==this_basin)[0]
if len(this_idx) == 0:
break
else:
this_idx = this_idx[0]
if df['frac_basin_cover'].values[this_idx] < 0.75:
break
else:
possible_basin = this_basin
I think something like this could be implemented in GEE. I have tested this for ~1000 watersheds across the Arctic against the Caravan implementation. Here's a snapshot of the largest discrepancies:
v_mean
is my method (VotE), c_mean
is Caravan. For these mean values, I have averaged the variable's value across all ~1000 watersheds. The errors are percentages (v_mean
-c_mean
)/v_mean
). While I can't say that all the attributes' discrepancies are due to the above issue, the four that I have looked at in detail are.
Using percent errors as I have isn't a great method, since it scales with the range of the particular variable (e.g. in the above example, the range is from 0 to 99068 so the percent error is huge), but it highlights the variables that disagree between the two methods. To check this, here is a histogram of discrepancies for the rev_mc_usu
attribute between VotE and Caravan:
The overwhelming majority of watersheds agree perfectly, but a non-insignificant fraction have very large discrepancies. I manually checked five of these large discrepancies, and each one was due to the above issue.
Anyway, I was hoping to contribute to the Caravan collection with Arctic watersheds but I would rather use my local aggregation methods since it seems to handle finding the downstream HydroATLAS basin more reliably. I think users of Caravan datasets should be aware of this possible issue in accumulated HydroATLAS attributes--while the issue is infrequent, it can make an enormous difference in the returned attributed value!