Giter Site home page Giter Site logo

gfw-raster-analysis-lambda's Introduction

GFW raster analsyis in AWS Lambda

Functionality

Run zonal statistics on tree cover loss, GLAD alerts, or arbitrary contextual layers defined in our data lake.

See a gfw-data-api for how to access through analysis API.

Query Parameters

All layers should be referred to by their standard data lake column name: or is for boolean layers.

See the data API for a full list of registered layers.

Parameter Type Description Example
geostore_id (required) String A valid geostore ID containing the GeoJSON for the geometry of interest (see further specification in Limitations and assumtions cb64960710fd11925df3fda6a2005ca9
group_by [String] Rasters with categorical pixel values to group by. umd_tree_cover_loss__year, umd_glad_alerts, tsc_tree_cover_loss_drivers__type
filters [String] Rasters to apply as a mask. Pixels with NoData values will be filtered out of the final result. For umd_tree_cover_density_2000/2010, you can put a threshold number as the data type, and it will apply a filter for that threshold is__umd_regional_primary_forest_2001, umd_tree_cover_density_2000__30
sum [String] Pixel values will be summed based on intersection with group_by layers. If there are no group_by layers, all pixel values will be summed to a single number. Pixel value must be numerical. This field can also include area__ha or alert__count, which will give the pixel count or area. area__ha, whrc_aboveground_co2_emissions__Mg
start Date Filters date group_by columns to this start date. Must be a year or a YYYY-MM-DD formatted date. 2015, 2015-02-04
end Date Same format as 'start'. Must come after start. 2016 , 2016-02-10

Examples

Request:

{
    "geometry": {
        "type": "Polygon",
        "coordinates": [[[9, 4.1], [9.1, 4.1], [9.1, 4.2], [9, 4.2], [9, 4.1]]],
    },
    "group_by": ["umd_tree_cover_loss__year"],
    "filters": [
        "is__umd_regional_primary_forest_2001",
        "umd_tree_cover_density_2000__30",
    ],
    "sum": ["area__ha", "whrc_aboveground_co2_emissions__Mg"],
}

Response:

{
   "status":"success",
   "data":[
      {
         "umd_tree_cover_loss__year":2001,
         "area__ha":9.894410216509604,
         "whrc_aboveground_co2_emissions__Mg":3560.875476837158
      },
      {
         "umd_tree_cover_loss__year":2002,
         "area__ha":40.0378459923877,
         "whrc_aboveground_co2_emissions__Mg":14713.026161193848
      },
      {
         "umd_tree_cover_loss__year":2003,
         "area__ha":6.442871768889975,
         "whrc_aboveground_co2_emissions__Mg":2568.1107501983643
      },
      {
         "umd_tree_cover_loss__year":2005,
         "area__ha":3.2214358844449875,
         "whrc_aboveground_co2_emissions__Mg":1274.5636539459229
      },
      {
         "umd_tree_cover_loss__year":2006,
         "area__ha":22.01314521037408,
         "whrc_aboveground_co2_emissions__Mg":8167.388116836548
      },
      {
         "umd_tree_cover_loss__year":2007,
         "area__ha":0.23010256317464195,
         "whrc_aboveground_co2_emissions__Mg":136.68091201782227
      },
      {
         "umd_tree_cover_loss__year":2008,
         "area__ha":3.7583418651858187,
         "whrc_aboveground_co2_emissions__Mg":1579.5646076202393
      },
      {
         "umd_tree_cover_loss__year":2009,
         "area__ha":0.7670085439154732,
         "whrc_aboveground_co2_emissions__Mg":226.95782279968262
      },
      {
         "umd_tree_cover_loss__year":2010,
         "area__ha":108.37830725525636,
         "whrc_aboveground_co2_emissions__Mg":41855.43841171265
      },
      {
         "umd_tree_cover_loss__year":2011,
         "area__ha":12.88574353777995,
         "whrc_aboveground_co2_emissions__Mg":4887.8897132873535
      },
      {
         "umd_tree_cover_loss__year":2012,
         "area__ha":0.07670085439154732,
         "whrc_aboveground_co2_emissions__Mg":23.061389923095703
      },
      {
         "umd_tree_cover_loss__year":2013,
         "area__ha":1.6107179422224938,
         "whrc_aboveground_co2_emissions__Mg":601.4241733551025
      },
      {
         "umd_tree_cover_loss__year":2014,
         "area__ha":54.30420490921551,
         "whrc_aboveground_co2_emissions__Mg":22433.24832725525
      },
      {
         "umd_tree_cover_loss__year":2015,
         "area__ha":0.3068034175661893,
         "whrc_aboveground_co2_emissions__Mg":119.5254955291748
      },
      {
         "umd_tree_cover_loss__year":2016,
         "area__ha":5.752564079366049,
         "whrc_aboveground_co2_emissions__Mg":2075.9469604492188
      },
      {
         "umd_tree_cover_loss__year":2017,
         "area__ha":24.774375968469784,
         "whrc_aboveground_co2_emissions__Mg":9848.338472366333
      },
      {
         "umd_tree_cover_loss__year":2018,
         "area__ha":29.75993150392036,
         "whrc_aboveground_co2_emissions__Mg":11987.563570022583
      },
      {
         "umd_tree_cover_loss__year":2019,
         "area__ha":27.382205017782393,
         "whrc_aboveground_co2_emissions__Mg":10558.882364273071
      }
   ]
}

Endpoints

https://staging-data-api.globalforestwatch.org/analysis
https://data-api.globalforestwatch.org/analysis

Assumptions and limitations

GFW raster tiles are organized in 10 x 10 Degree grids and have a pixel size of 0.00025 x 0.00025 degree. They are saved as Cloud Optimized TIFFs with 400 x 400 pixels blocks.

Because we can scale up parallel processing with lambda, size of the geometry shouldn't be an issue unless getting to massive scales (> 1 billion ha). But each lambda has in-memory cap of 3 GB, so currently only so many rasters can be loaded into memory at once. The limit depends on the size of the raster values (e.g. binary is way less memory than float), but generally max 4 or 5 raster layers is a good rule of thumb.

Deployment

Use terraform:

./scripts/infra plan
./scripts/infra apply
Runtime: Python 3.7
Handler: lambda_function.lambda_handler

Future Development

Data Lake

The GFW data lake is now in production, so this service will soon point to that instead of just test layers. Once it does, all data lake layers should be available for analysis. This currently includes:

  • aqueduct_baseline_water_stress
  • aqueduct_erosion_risk
  • birdlife_alliance_for_zero_extinction_site
  • birdlife_endemic_bird_areas
  • birdlife_key_biodiversity_area
  • bra_biomes
  • esa_land_cover_2015
  • gfw_aboveground_carbon_stock_2000
  • gfw_aboveground_carbon_stock_in_emissions_year
  • gfw_aboveground_carbon_stock_in_emissions_year__biomass_swap
  • gfw_aboveground_carbon_stock_in_emissions_year__legal_amazon_loss
  • gfw_aboveground_carbon_stock_in_emissions_year__no_primary_gain
  • gfw_aboveground_carbon_stock_in_emissions_year__us_removals
  • gfw_belowground_carbon_stock_2000
  • gfw_belowground_carbon_stock_in_emissions_year
  • gfw_deadwood_carbon_stock_2000
  • gfw_deadwood_carbon_stock_in_emissions_year
  • gfw_forest_age_category
  • gfw_gross_annual_removals_biomass
  • gfw_gross_cumul_removals_co2
  • gfw_gross_cumul_removals_co2__biomass_swap
  • gfw_gross_cumul_removals_co2__legal_amazon_loss
  • gfw_gross_cumul_removals_co2__maxgain
  • gfw_gross_cumul_removals_co2__no_primary_gain
  • gfw_gross_cumul_removals_co2__us_removals
  • gfw_gross_emissions_co2e_co2_only
  • gfw_gross_emissions_co2e_co2_only__biomass_swap
  • gfw_gross_emissions_co2e_co2_only__convert_to_grassland
  • gfw_gross_emissions_co2e_co2_only__legal_amazon_loss
  • gfw_gross_emissions_co2e_co2_only__no_primary_gain
  • gfw_gross_emissions_co2e_co2_only__no_shifting_ag
  • gfw_gross_emissions_co2e_co2_only__soil_only
  • gfw_gross_emissions_co2e_co2_only__us_removals
  • gfw_gross_emissions_co2e_non_co2
  • gfw_gross_emissions_co2e_non_co2__biomass_swap
  • gfw_gross_emissions_co2e_non_co2__convert_to_grassland
  • gfw_gross_emissions_co2e_non_co2__legal_amazon_loss
  • gfw_gross_emissions_co2e_non_co2__no_primary_gain
  • gfw_gross_emissions_co2e_non_co2__no_shifting_ag
  • gfw_gross_emissions_co2e_non_co2__soil_only
  • gfw_gross_emissions_co2e_non_co2__us_removals
  • gfw_intact_or_primary_forest_2000
  • gfw_land_rights
  • gfw_litter_carbon_stock_2000
  • gfw_litter_carbon_stock_in_emissions_year
  • gfw_managed_forests
  • gfw_mining
  • gfw_net_flux_co2e
  • gfw_net_flux_co2e__biomass_swap
  • gfw_net_flux_co2e__convert_to_grassland
  • gfw_net_flux_co2e__legal_amazon_loss
  • gfw_net_flux_co2e__maxgain
  • gfw_net_flux_co2e__no_primary_gain
  • gfw_net_flux_co2e__no_shifting_ag
  • gfw_net_flux_co2e__us_removals
  • gfw_oil_gas
  • gfw_oil_palm
  • gfw_peatlands
  • gfw_peatlands__flux
  • gfw_pixel_area
  • gfw_plantations
  • gfw_resource_rights
  • gfw_soil_carbon_stock_2000
  • gfw_soil_carbon_stock_in_emissions_year
  • gfw_soil_carbon_stock_in_emissions_year__biomass_swap
  • gfw_soil_carbon_stock_in_emissions_year__legal_amazon_loss
  • gfw_soil_carbon_stock_in_emissions_year__no_primary_gain
  • gfw_soil_carbon_stock_in_emissions_year__us_removals
  • gfw_tiger_landscapes
  • gfw_total_carbon_stock_2000
  • gfw_total_carbon_stock_in_emissions_year
  • gfw_wood_fiber
  • gmw_mangroves_1996
  • gmw_mangroves_2016
  • idn_forest_area
  • idn_forest_moratorium
  • idn_land_cover_2017
  • idn_primary_forest_2000
  • ifl_intact_forest_landscapes
  • jpl_mangrove_aboveground_biomass_stock_2000
  • jpl_tropics_abovegroundbiomass_2000
  • landmark_land_rights
  • mapbox_river_basins
  • mex_forest_zoning
  • mex_payment_ecosystem_services
  • mex_protected_areas
  • per_forest_concessions
  • per_permanent_production_forests
  • per_protected_areas
  • rspo_oil_palm
  • tnc_urban_water_intake
  • tsc_tree_cover_loss_drivers
  • umd_regional_primary_forest_2001
  • umd_tree_cover_density_2000
  • umd_tree_cover_density_2010
  • umd_tree_cover_gain
  • umd_tree_cover_loss
  • usfs_fia_regions
  • wdpa_protected_areas
  • whrc_aboveground_biomass_stock_2000
  • wwf_eco_regions

VIIRS/MODIS Alerts

These alerts are currently unsupported because we don't rasterize these layers. Instead, we store all enriched points in an document dataset. You can do on-the-fly analysis for these via SQL. (TBD: do we want to just forward that through here so there's only one endpoint?)

Aggregation

There will be a new String parameter called agg that will accept one of day | week | month | year and return results aggregated by that timeline.

Whitelist

There will be a new endpoint that will return a whitelist of whether layers intersect the input geometry. Details TBD.

Misc Layers

Need to decide if/how we will support miscellanious layers on GFW but not maintained in the data lake, like PRODES and Terra-i. TBD.

Lat/Lon Coordinates

A new analysis will be added to retrieve lat/lon coordinates of points (e.g. for GLAD alerts).

gfw-raster-analysis-lambda's People

Contributors

dmannarino avatar fmwest avatar jterry64 avatar mappingvermont avatar mmcfarland avatar solomon-negusse avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gfw-raster-analysis-lambda's Issues

Prototype with Step Function

Create a prototype using this lambda function with Step Function's Map State to see if it can work as intended (calling 100s-1000s in parallel with high performance) and to see if Step Function could work to connect the components of this service.

It seems like the Map State feature was only released a week ago, so this will also let us see if it's ready to use at our scale yet.

Change to more SQL-like interface

We've discussed the idea of the API actually just accepting a SQL statement to make it consistent with other APIs. As I'm testing it on a different widget scenarios, I think we might need something more like that to support all scenarios.

For example, one of the widgets just ranks certain areas by the amount of relative forest cover. This doesn't need any particular analysis layers, and rather than filtering the data, wants the area with and without the filter. Right now you'd just have to choose a dummy analysis layer, add the contextual layer you want to rank (e.g. wdpa or adm2) and apply a tcd filter, then extract this data from the summary table.

I think it'd be way cleaner to just ask for this like:
select=wdpa, (tcd > 30) as is__forest, area

This would just return a table like:
wdpa is__forest area
0 0 435.343
0 1 4435.343
1 0 2435.343
1 1 42335.343
2 0 423435.343
2 1 343435.343
...

Then the front end can aggregate on wdpa id for total area of protected area, and by wdpa id and is__forest to get tree cover.

Then, when the front end wants a widget where they get forest loss per year, and only care about areas that hit the tcd threshold and have loss, they can ask like:
select=loss, area
where=tcd_2000 > 30 and loss != NaN

So there's no special analysis layer, you just ask explicitly ask to not include NoData values.

We also need this for complex values like emissions per year, which is actually biomass * loss_area * co2_constant. We only have the biomass as data, and right now just have a weird hook parameter to calculate this. It'd be great to do this as:
select=loss, area, (sum(biomass) * area * co2_constant) as emissions

(although that'll need pretty much full SQL parsing)

Read all rasters in parallel upfront

After some initial performance tests with XRay, I can see unsurprisingly IO consumes the most time. Currently all the rasters are being read one at time. One easy optimization we could do is just to read all the input rasters in parallel upfront.

A simple experiment on a 1 MHa geometry using the threading module showed it works pretty well. Before, each read took ~2 seconds, one after another. Multithreaded, as expected, it takes ~2 seconds total to read all at the same time.

Change tiled-analysis lambda to use datacube tiling analysis

Currently analyzing large geometries by tiling them and running all tiles on parallel areas. Need to implement datacube algorithm to break apart geometries into least number of tiles, fetch those as precomputed results for the summary statistics data store, and run the edge tiles as raster-analysis lambda functions.

Investigate area calculation inaccuracies

After doing some more fine grained testing, I've found using the mean area only for area calculation is sometimes too inaccurate.

Usually it evens out across the entire AOI, but for more granular calculations that might only look at the northern or southern end of the AOI, it can get off. For example, if you want to get loss in a specific protected area in the AOI, but the protected area is only overlapping the top of the AOI.

Doing the full calculation with area for each pixel latitude was too performance intensive last time. Going to experiment with some faster ways to get a close approximation.

Add serialize decorator to lambda_handler

There should be a @serialize decorator for the lambda handler to handle serialization of the HTTP response. The function itself should either return a dictionary with the response or throw and exception. We might need custom error types for that.

The rest should be handled in the wrapper function.

something along the line like this

def serialize(func):
    def wrapper(*args, **kwargs):
        try:
             body = func(*args, **kwargs)
             result = {"statusCode": 200,
                  "headers": {"Content-Type": "application/json"},
                  "body": body}
        except (CustomError1, CustomError2) as e:
             result = ....
        except Exception as e:
             result = {"statusCode": 500,
                  "headers": {"Content-Type": "application/json"},
                  "body": {"message": e}
                  }
       return result
    return wrapper

and then

@serialize
def lambda_handler(event, context):
    ....

Test use of Pandas dataframes

We currently use typed numpy arrays for the calculations. It works fine but is a bit clunky. Pandas dataframes would be way more elegant. However, there are some limitations with lambda functions.

  • Pandas adds additional ~20MB to the packages and file size of the ZIP package would exceed 50 MB.

I am not sure how strict the 50 MB limit is. We could try to add Pandas to the ZIP package and see if we can deploy it. If yes, we could refactor to code and see if it performs as well or better as current implementation.

Remove dependency on 10x10 tile grids

Currently these analyses depend on all the input rasters being aligned on the 10x10 degree grids. We should remove that dependency and allow for retrieving tiles at different zoom levels.

Two solution ideas to investigate:

  1. Using VRTs so input rasters don't need to be preprocessed into block alignment
  2. Forcing client to have rasters aligned to our own tile schema that work on different zoom levels

VRTs would allow much more flexibility, but we need to benchmark whether the on-read warping it does meets our performance requirements.

Refactor API to accept generic filter/threshold parameter

Instead of accepting a threshold value just for TCD dataset, the API should generically accepted filter layer IDs and the threshold values for the same. The returned extent should be the total filtered extent. (need to confirm if separate TCD 2000 and 2010 extents are necessary in the same request)

Compare performance and accuracy for using mean area and pixel area

When used for larger tiles, the mean area approach might reduce the accuracy of the results quite a bit.
In this case it will be would be better to use the actual pixel area.

The below link shows an example on how you would do that
https://github.com/wri/raster2points/blob/master/raster2points/raster2points.py#L208-L241

The input array would be already fully masked. You would only need to calculate the lat coordinate and only return the area.

This will return a float32 array. If memory becomes an issue you could try to cast it to a float16 or uint and see if it significantly effects accuracy. The function calculates area in m2 and we will later convert it to ha. Do dropping some digits might not have too much of an effect.

Performance Testing

Ticket just to keep track of results from performance testing. AWS XRay is now configured, and thoughts from a few initial tests:

  • S3 reads are the biggest bottlenecks and are proportional to size of data in window. Currently changing to do all reads in parallel, but may also considering encoding multiple rasters commonly use rasters into one.
  • S3 reads are significantly faster on the second time. I'm curious if this is just for the section of the tile read, or reading anywhere from the tile will be faster the more you do it (that would have great performance implications for the full datacube algorithm).
  • Running on 5 MHa area with 3 rasters hit the memory limits, so might need to be run on smaller areas than we though. The code is duplicating the data at a few points, so reducing that could get us larger areas (or if there's a way to force Python to release the memory at each stage?)
  • The pandas aggregation does actually take a significant amount of time for large areas, on the same scale as each S3 read. We can see what possibilities there are for speeding that up.

Add Github Actions for Automatic Testing/Code Coverage

Is your feature request related to a problem? Please describe.
No automated checks when merging from feature->develop or develop->master.

Describe the solution you'd like
Add github actions to automate running tests and code coverage when merging pull requests (copying the same way from wri/gfw_pixetl)

Potential rounding errors with window size

The window size calculated using from the geometry bounds can be slightly off for different data sources if their VRTs aren't exactly aligned. This can cause the windows to be one off from each other, causing errors when trying to analyze together.

Saw this issue because the dev data lake has data sources with only a few tiles, so the VRT is off from each other. Not sure if this will be an actual issue once all the VRTs encompass the global extent, but something to watch out for.

Use new data lake URL format for raster URLs

Still using a test URL for raster URLs, should change to using the new data lake URL format. It should also use an environment variable to determine whether to use the dev, staging or production data lakes.

Clarify how primary raster works

From the perspective of the API, the input rasters are just a list and all treated the same. However, the code currently assumes the first raster ID references one of our primary data products (e.g. GLAD alerts, forest loss), and all the ones after are just contextual.

We should either make this clear in the API if we want to treat the primary raster differently. But right now the only real difference is that the primary raster source missing or empty gives a completely empty result, whereas the code just ignores missing contextual sources and still does the rest of the analysis. Otherwise the first is just arbitrarily chosen to create the geometry mask. So we could possibly just treat all rasters the same, and leave the responsibility on the client for raster source issues.

Introduce basic logging

Log to stdout. we will later collect logs using cloud watch

use logging.debug for individual steps and logging.info only for main events so that we don't create too much noise
logging.error/ warning for exceptions

Expand unit tests

Currently there are only unit tests mocking data. We should create some end-to-end blackbox tests pointing to actual datasets and using sufficiently complex geometries. These can be run before committing.

Expand unit tests

Right now only a few basic unit tests, should expand them to increase confidence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.