wri / gfw-raster-analysis-lambda Goto Github PK

View Code? Open in Web Editor NEW

8.0 9.0 4.0 39.22 MB

GFW Raster Analysis in AWS Lambda

Dockerfile 1.23% Python 84.02% Shell 2.91% HCL 11.83%

gfw-raster-analysis-lambda's Introduction

GFW raster analsyis in AWS Lambda

Functionality

Run zonal statistics on tree cover loss, GLAD alerts, or arbitrary contextual layers defined in our data lake.

See a gfw-data-api for how to access through analysis API.

Query Parameters

All layers should be referred to by their standard data lake column name: or is for boolean layers.

See the data API for a full list of registered layers.

Parameter	Type	Description	Example
geostore_id (required)	String	A valid geostore ID containing the GeoJSON for the geometry of interest (see further specification in `Limitations and assumtions`	cb64960710fd11925df3fda6a2005ca9
group_by	[String]	Rasters with categorical pixel values to group by.	umd_tree_cover_loss__year, umd_glad_alerts, tsc_tree_cover_loss_drivers__type
filters	[String]	Rasters to apply as a mask. Pixels with NoData values will be filtered out of the final result. For umd_tree_cover_density_2000/2010, you can put a threshold number as the data type, and it will apply a filter for that threshold	is__umd_regional_primary_forest_2001, umd_tree_cover_density_2000__30
sum	[String]	Pixel values will be summed based on intersection with group_by layers. If there are no group_by layers, all pixel values will be summed to a single number. Pixel value must be numerical. This field can also include area__ha or alert__count, which will give the pixel count or area.	area__ha, whrc_aboveground_co2_emissions__Mg
start	Date	Filters date group_by columns to this start date. Must be a year or a YYYY-MM-DD formatted date.	2015, 2015-02-04
end	Date	Same format as 'start'. Must come after start.	2016 , 2016-02-10

Examples

Request:

{
    "geometry": {
        "type": "Polygon",
        "coordinates": [[[9, 4.1], [9.1, 4.1], [9.1, 4.2], [9, 4.2], [9, 4.1]]],
    },
    "group_by": ["umd_tree_cover_loss__year"],
    "filters": [
        "is__umd_regional_primary_forest_2001",
        "umd_tree_cover_density_2000__30",
    ],
    "sum": ["area__ha", "whrc_aboveground_co2_emissions__Mg"],
}

Response:

{
   "status":"success",
   "data":[
      {
         "umd_tree_cover_loss__year":2001,
         "area__ha":9.894410216509604,
         "whrc_aboveground_co2_emissions__Mg":3560.875476837158
      },
      {
         "umd_tree_cover_loss__year":2002,
         "area__ha":40.0378459923877,
         "whrc_aboveground_co2_emissions__Mg":14713.026161193848
      },
      {
         "umd_tree_cover_loss__year":2003,
         "area__ha":6.442871768889975,
         "whrc_aboveground_co2_emissions__Mg":2568.1107501983643
      },
      {
         "umd_tree_cover_loss__year":2005,
         "area__ha":3.2214358844449875,
         "whrc_aboveground_co2_emissions__Mg":1274.5636539459229
      },
      {
         "umd_tree_cover_loss__year":2006,
         "area__ha":22.01314521037408,
         "whrc_aboveground_co2_emissions__Mg":8167.388116836548
      },
      {
         "umd_tree_cover_loss__year":2007,
         "area__ha":0.23010256317464195,
         "whrc_aboveground_co2_emissions__Mg":136.68091201782227
      },
      {
         "umd_tree_cover_loss__year":2008,
         "area__ha":3.7583418651858187,
         "whrc_aboveground_co2_emissions__Mg":1579.5646076202393
      },
      {
         "umd_tree_cover_loss__year":2009,
         "area__ha":0.7670085439154732,
         "whrc_aboveground_co2_emissions__Mg":226.95782279968262
      },
      {
         "umd_tree_cover_loss__year":2010,
         "area__ha":108.37830725525636,
         "whrc_aboveground_co2_emissions__Mg":41855.43841171265
      },
      {
         "umd_tree_cover_loss__year":2011,
         "area__ha":12.88574353777995,
         "whrc_aboveground_co2_emissions__Mg":4887.8897132873535
      },
      {
         "umd_tree_cover_loss__year":2012,
         "area__ha":0.07670085439154732,
         "whrc_aboveground_co2_emissions__Mg":23.061389923095703
      },
      {
         "umd_tree_cover_loss__year":2013,
         "area__ha":1.6107179422224938,
         "whrc_aboveground_co2_emissions__Mg":601.4241733551025
      },
      {
         "umd_tree_cover_loss__year":2014,
         "area__ha":54.30420490921551,
         "whrc_aboveground_co2_emissions__Mg":22433.24832725525
      },
      {
         "umd_tree_cover_loss__year":2015,
         "area__ha":0.3068034175661893,
         "whrc_aboveground_co2_emissions__Mg":119.5254955291748
      },
      {
         "umd_tree_cover_loss__year":2016,
         "area__ha":5.752564079366049,
         "whrc_aboveground_co2_emissions__Mg":2075.9469604492188
      },
      {
         "umd_tree_cover_loss__year":2017,
         "area__ha":24.774375968469784,
         "whrc_aboveground_co2_emissions__Mg":9848.338472366333
      },
      {
         "umd_tree_cover_loss__year":2018,
         "area__ha":29.75993150392036,
         "whrc_aboveground_co2_emissions__Mg":11987.563570022583
      },
      {
         "umd_tree_cover_loss__year":2019,
         "area__ha":27.382205017782393,
         "whrc_aboveground_co2_emissions__Mg":10558.882364273071
      }
   ]
}

Endpoints

https://staging-data-api.globalforestwatch.org/analysis
https://data-api.globalforestwatch.org/analysis

Assumptions and limitations

GFW raster tiles are organized in 10 x 10 Degree grids and have a pixel size of 0.00025 x 0.00025 degree. They are saved as Cloud Optimized TIFFs with 400 x 400 pixels blocks.

Because we can scale up parallel processing with lambda, size of the geometry shouldn't be an issue unless getting to massive scales (> 1 billion ha). But each lambda has in-memory cap of 3 GB, so currently only so many rasters can be loaded into memory at once. The limit depends on the size of the raster values (e.g. binary is way less memory than float), but generally max 4 or 5 raster layers is a good rule of thumb.

Deployment

Use terraform:

./scripts/infra plan
./scripts/infra apply

Runtime: Python 3.7
Handler: lambda_function.lambda_handler

Future Development

Data Lake

The GFW data lake is now in production, so this service will soon point to that instead of just test layers. Once it does, all data lake layers should be available for analysis. This currently includes:

aqueduct_baseline_water_stress
aqueduct_erosion_risk
birdlife_alliance_for_zero_extinction_site
birdlife_endemic_bird_areas
birdlife_key_biodiversity_area
bra_biomes
esa_land_cover_2015
gfw_aboveground_carbon_stock_2000
gfw_aboveground_carbon_stock_in_emissions_year
gfw_aboveground_carbon_stock_in_emissions_year__biomass_swap
gfw_aboveground_carbon_stock_in_emissions_year__legal_amazon_loss
gfw_aboveground_carbon_stock_in_emissions_year__no_primary_gain
gfw_aboveground_carbon_stock_in_emissions_year__us_removals
gfw_belowground_carbon_stock_2000
gfw_belowground_carbon_stock_in_emissions_year
gfw_deadwood_carbon_stock_2000
gfw_deadwood_carbon_stock_in_emissions_year
gfw_forest_age_category
gfw_gross_annual_removals_biomass
gfw_gross_cumul_removals_co2
gfw_gross_cumul_removals_co2__biomass_swap
gfw_gross_cumul_removals_co2__legal_amazon_loss
gfw_gross_cumul_removals_co2__maxgain
gfw_gross_cumul_removals_co2__no_primary_gain
gfw_gross_cumul_removals_co2__us_removals
gfw_gross_emissions_co2e_co2_only
gfw_gross_emissions_co2e_co2_only__biomass_swap
gfw_gross_emissions_co2e_co2_only__convert_to_grassland
gfw_gross_emissions_co2e_co2_only__legal_amazon_loss
gfw_gross_emissions_co2e_co2_only__no_primary_gain
gfw_gross_emissions_co2e_co2_only__no_shifting_ag
gfw_gross_emissions_co2e_co2_only__soil_only
gfw_gross_emissions_co2e_co2_only__us_removals
gfw_gross_emissions_co2e_non_co2
gfw_gross_emissions_co2e_non_co2__biomass_swap
gfw_gross_emissions_co2e_non_co2__convert_to_grassland
gfw_gross_emissions_co2e_non_co2__legal_amazon_loss
gfw_gross_emissions_co2e_non_co2__no_primary_gain
gfw_gross_emissions_co2e_non_co2__no_shifting_ag
gfw_gross_emissions_co2e_non_co2__soil_only
gfw_gross_emissions_co2e_non_co2__us_removals
gfw_intact_or_primary_forest_2000
gfw_land_rights
gfw_litter_carbon_stock_2000
gfw_litter_carbon_stock_in_emissions_year
gfw_managed_forests
gfw_mining
gfw_net_flux_co2e
gfw_net_flux_co2e__biomass_swap
gfw_net_flux_co2e__convert_to_grassland
gfw_net_flux_co2e__legal_amazon_loss
gfw_net_flux_co2e__maxgain
gfw_net_flux_co2e__no_primary_gain
gfw_net_flux_co2e__no_shifting_ag
gfw_net_flux_co2e__us_removals
gfw_oil_gas
gfw_oil_palm
gfw_peatlands
gfw_peatlands__flux
gfw_pixel_area
gfw_plantations
gfw_resource_rights
gfw_soil_carbon_stock_2000
gfw_soil_carbon_stock_in_emissions_year
gfw_soil_carbon_stock_in_emissions_year__biomass_swap
gfw_soil_carbon_stock_in_emissions_year__legal_amazon_loss
gfw_soil_carbon_stock_in_emissions_year__no_primary_gain
gfw_soil_carbon_stock_in_emissions_year__us_removals
gfw_tiger_landscapes
gfw_total_carbon_stock_2000
gfw_total_carbon_stock_in_emissions_year
gfw_wood_fiber
gmw_mangroves_1996
gmw_mangroves_2016
idn_forest_area
idn_forest_moratorium
idn_land_cover_2017
idn_primary_forest_2000
ifl_intact_forest_landscapes
jpl_mangrove_aboveground_biomass_stock_2000
jpl_tropics_abovegroundbiomass_2000
landmark_land_rights
mapbox_river_basins
mex_forest_zoning
mex_payment_ecosystem_services
mex_protected_areas
per_forest_concessions
per_permanent_production_forests
per_protected_areas
rspo_oil_palm
tnc_urban_water_intake
tsc_tree_cover_loss_drivers
umd_regional_primary_forest_2001
umd_tree_cover_density_2000
umd_tree_cover_density_2010
umd_tree_cover_gain
umd_tree_cover_loss
usfs_fia_regions
wdpa_protected_areas
whrc_aboveground_biomass_stock_2000
wwf_eco_regions

VIIRS/MODIS Alerts

These alerts are currently unsupported because we don't rasterize these layers. Instead, we store all enriched points in an document dataset. You can do on-the-fly analysis for these via SQL. (TBD: do we want to just forward that through here so there's only one endpoint?)

Aggregation

There will be a new String parameter called agg that will accept one of day | week | month | year and return results aggregated by that timeline.

Whitelist

There will be a new endpoint that will return a whitelist of whether layers intersect the input geometry. Details TBD.

Misc Layers

Need to decide if/how we will support miscellanious layers on GFW but not maintained in the data lake, like PRODES and Terra-i. TBD.

Lat/Lon Coordinates

A new analysis will be added to retrieve lat/lon coordinates of points (e.g. for GLAD alerts).

gfw-raster-analysis-lambda's People

Contributors

Stargazers

Watchers

Forkers

abilwani lucyhartung ziveo

gfw-raster-analysis-lambda's Issues

Prototype with Step Function

Create a prototype using this lambda function with Step Function's Map State to see if it can work as intended (calling 100s-1000s in parallel with high performance) and to see if Step Function could work to connect the components of this service.

It seems like the Map State feature was only released a week ago, so this will also let us see if it's ready to use at our scale yet.

Change to more SQL-like interface

We've discussed the idea of the API actually just accepting a SQL statement to make it consistent with other APIs. As I'm testing it on a different widget scenarios, I think we might need something more like that to support all scenarios.

For example, one of the widgets just ranks certain areas by the amount of relative forest cover. This doesn't need any particular analysis layers, and rather than filtering the data, wants the area with and without the filter. Right now you'd just have to choose a dummy analysis layer, add the contextual layer you want to rank (e.g. wdpa or adm2) and apply a tcd filter, then extract this data from the summary table.

I think it'd be way cleaner to just ask for this like:
select=wdpa, (tcd > 30) as is__forest, area

This would just return a table like:
wdpa is__forest area
0 0 435.343
0 1 4435.343
1 0 2435.343
1 1 42335.343
2 0 423435.343
2 1 343435.343
...

Then the front end can aggregate on wdpa id for total area of protected area, and by wdpa id and is__forest to get tree cover.

Then, when the front end wants a widget where they get forest loss per year, and only care about areas that hit the tcd threshold and have loss, they can ask like:
select=loss, area
where=tcd_2000 > 30 and loss != NaN

So there's no special analysis layer, you just ask explicitly ask to not include NoData values.

We also need this for complex values like emissions per year, which is actually biomass * loss_area * co2_constant. We only have the biomass as data, and right now just have a weird hook parameter to calculate this. It'd be great to do this as:
select=loss, area, (sum(biomass) * area * co2_constant) as emissions

(although that'll need pretty much full SQL parsing)

Create terraform infrastructure code

Create terraform code for deploying lambda functions and API gateway to different environments.

Read all rasters in parallel upfront

After some initial performance tests with XRay, I can see unsurprisingly IO consumes the most time. Currently all the rasters are being read one at time. One easy optimization we could do is just to read all the input rasters in parallel upfront.

A simple experiment on a 1 MHa geometry using the threading module showed it works pretty well. Before, each read took ~2 seconds, one after another. Multithreaded, as expected, it takes ~2 seconds total to read all at the same time.

Change tiled-analysis lambda to use datacube tiling analysis

Currently analyzing large geometries by tiling them and running all tiles on parallel areas. Need to implement datacube algorithm to break apart geometries into least number of tiles, fetch those as precomputed results for the summary statistics data store, and run the edge tiles as raster-analysis lambda functions.

Investigate area calculation inaccuracies

After doing some more fine grained testing, I've found using the mean area only for area calculation is sometimes too inaccurate.

Usually it evens out across the entire AOI, but for more granular calculations that might only look at the northern or southern end of the AOI, it can get off. For example, if you want to get loss in a specific protected area in the AOI, but the protected area is only overlapping the top of the AOI.

Doing the full calculation with area for each pixel latitude was too performance intensive last time. Going to experiment with some faster ways to get a close approximation.

Add serialize decorator to lambda_handler

There should be a @serialize decorator for the lambda handler to handle serialization of the HTTP response. The function itself should either return a dictionary with the response or throw and exception. We might need custom error types for that.

The rest should be handled in the wrapper function.

something along the line like this

def serialize(func):
    def wrapper(*args, **kwargs):
        try:
             body = func(*args, **kwargs)
             result = {"statusCode": 200,
                  "headers": {"Content-Type": "application/json"},
                  "body": body}
        except (CustomError1, CustomError2) as e:
             result = ....
        except Exception as e:
             result = {"statusCode": 500,
                  "headers": {"Content-Type": "application/json"},
                  "body": {"message": e}
                  }
       return result
    return wrapper

and then

@serialize
def lambda_handler(event, context):
    ....

Test use of Pandas dataframes

We currently use typed numpy arrays for the calculations. It works fine but is a bit clunky. Pandas dataframes would be way more elegant. However, there are some limitations with lambda functions.

Pandas adds additional ~20MB to the packages and file size of the ZIP package would exceed 50 MB.

I am not sure how strict the 50 MB limit is. We could try to add Pandas to the ZIP package and see if we can deploy it. If yes, we could refactor to code and see if it performs as well or better as current implementation.

Remove dependency on 10x10 tile grids

Currently these analyses depend on all the input rasters being aligned on the 10x10 degree grids. We should remove that dependency and allow for retrieving tiles at different zoom levels.

Two solution ideas to investigate:

Using VRTs so input rasters don't need to be preprocessed into block alignment
Forcing client to have rasters aligned to our own tile schema that work on different zoom levels

VRTs would allow much more flexibility, but we need to benchmark whether the on-read warping it does meets our performance requirements.

Refactor API to accept generic filter/threshold parameter

Instead of accepting a threshold value just for TCD dataset, the API should generically accepted filter layer IDs and the threshold values for the same. The returned extent should be the total filtered extent. (need to confirm if separate TCD 2000 and 2010 extents are necessary in the same request)

Complete MVP and deploy to staging for Vizzuality feedback

Compare performance and accuracy for using mean area and pixel area

When used for larger tiles, the mean area approach might reduce the accuracy of the results quite a bit.
In this case it will be would be better to use the actual pixel area.

The below link shows an example on how you would do that
https://github.com/wri/raster2points/blob/master/raster2points/raster2points.py#L208-L241

The input array would be already fully masked. You would only need to calculate the lat coordinate and only return the area.

This will return a float32 array. If memory becomes an issue you could try to cast it to a float16 or uint and see if it significantly effects accuracy. The function calculates area in m2 and we will later convert it to ha. Do dropping some digits might not have too much of an effect.

Performance Testing

Ticket just to keep track of results from performance testing. AWS XRay is now configured, and thoughts from a few initial tests:

S3 reads are the biggest bottlenecks and are proportional to size of data in window. Currently changing to do all reads in parallel, but may also considering encoding multiple rasters commonly use rasters into one.
S3 reads are significantly faster on the second time. I'm curious if this is just for the section of the tile read, or reading anywhere from the tile will be faster the more you do it (that would have great performance implications for the full datacube algorithm).
Running on 5 MHa area with 3 rasters hit the memory limits, so might need to be run on smaller areas than we though. The code is duplicating the data at a few points, so reducing that could get us larger areas (or if there's a way to force Python to release the memory at each stage?)
The pandas aggregation does actually take a significant amount of time for large areas, on the same scale as each S3 read. We can see what possibilities there are for speeding that up.

Add Github Actions for Automatic Testing/Code Coverage

Is your feature request related to a problem? Please describe.
No automated checks when merging from feature->develop or develop->master.

Describe the solution you'd like
Add github actions to automate running tests and code coverage when merging pull requests (copying the same way from wri/gfw_pixetl)

Potential rounding errors with window size

The window size calculated using from the geometry bounds can be slightly off for different data sources if their VRTs aren't exactly aligned. This can cause the windows to be one off from each other, causing errors when trying to analyze together.

Saw this issue because the dev data lake has data sources with only a few tiles, so the VRT is off from each other. Not sure if this will be an actual issue once all the VRTs encompass the global extent, but something to watch out for.

Develop service to keep lambda containers warm

Once we get closer to production, create a service to regularly ping a number of lambda containers every 5 min to keep they warm (so we don't have to deal with cold start delays every time): https://medium.com/thundra/dealing-with-cold-starts-in-aws-lambda-a5e3aa8f532

Use new data lake URL format for raster URLs

Still using a test URL for raster URLs, should change to using the new data lake URL format. It should also use an environment variable to determine whether to use the dev, staging or production data lakes.

Clarify how primary raster works

From the perspective of the API, the input rasters are just a list and all treated the same. However, the code currently assumes the first raster ID references one of our primary data products (e.g. GLAD alerts, forest loss), and all the ones after are just contextual.

We should either make this clear in the API if we want to treat the primary raster differently. But right now the only real difference is that the primary raster source missing or empty gives a completely empty result, whereas the code just ignores missing contextual sources and still does the rest of the analysis. Otherwise the first is just arbitrarily chosen to create the geometry mask. So we could possibly just treat all rasters the same, and leave the responsibility on the client for raster source issues.

Get On-the-fly Analysis API Production Ready for User Dashboards

Create terraform infrastructure
Add sufficient tests
Add github actions
Fix some minor issues

This encompasses a number of issues, currently tracking through this sub-project: https://github.com/wri/gfw-raster-analysis-lambda/projects/1

Avoid having to add pixel buffer

get_window_and_affine function currently adds a pixel buffer to the read window at bottom and right side. Without this buffer we cannot create the mask.
https://github.com/gfw-api/gfw-raster-analysis-lambda/blob/master/raster_analysis/io.py#L134-L135

However, we need to read more COG blocks like this and it would be ideal if we could avoid this or remove the buffer again before reading the image