Giter Site home page Giter Site logo

es-aggregation-sg's Introduction

es-aggregation-sg

Wranglers

Calculate Column (EntRef/County/..) Wrangler

The wrangler is responsible for preparing the data, invoking the lambda and then sending the data downstream along with the respective notification messages (SNS).

Steps performed:

- Retrieves data From S3 bucket
- Invokes method lambda
- Puts the aggregated data in an S3 bucket
- Sends SNS message

Calculate Top 2 Wrangler

The wrangler is responsible for preparing the data, invoking the method lambda and sending the data downstream along with the respective notification messages (SNS).

Steps performed:

- Retrieves data from S3 bucket
- Converts the data from json to dataframe,
- Ensures the mandatory columns are present and correctly typed
- Appends the new output columns in zero state
- Sends the dataframe to the method
- Ensures the new columns are still present and correctly typed in the returned dataframe
- Serialises the dataframe back to json
- Saves the data in an S3 bucket 
- Notifies via SNS   

Methods

Calculate Enterprise Reference Count Method

Name of Lambda: aggregation_column_method

Summary: This method is responsible for grouping the data by a given column, and region. It then aggregates on the specified column (e.g. enterprise_ref) creating a total (e.g. ent_ref_count) and then renames the column accordingly.

Inputs: event: {"RuntimeVariables":{
aggregated_column - A column to aggregate by. e.g. Enterprise_Reference.
additional_aggregated_column - A column to aggregate by. e.g. Region.
aggregation_type - How we wish to do the aggregation. e.g. sum, count, nunique.
total_columns - The names of the columns to produce aggregations for.
cell_total_column - Name of column to rename total_column.
}}

Outputs: A JSON dict which contains a success marker and the aggregated data with the column count/sum.
e.g. {"success": True/False, None/"error": NA/"Message"}


Calculate Top Two Method

Name of Lambda: aggregation_top2_wrangler

Summary: Takes a DataFrame in json format and calculates the highest and second highest total within each unique combination of the aggregated_column and additional aggregated column (column names are adjustable in the runtime variables). These are then appended as two new columns. The DataFrame is saved to S3 as json and a notification sent on to the next module via SNS.

Inputs: event: {"RuntimeVariables":{
aggregated_column - A column to aggregate by. e.g. Enterprise_Reference.
additional_aggregated_column - A column to aggregate by. e.g. Region.
total_columns - The names of the columns to produce aggregations for.
}}

Outputs: A JSON dict which contains a success marker and the input DataFrame with the following two columns appended: "largest_contributor" and "second_largest_contributor"
e.g. {"success": True/False, None/"error": NA/"Message"}


Combiner

The combiner is used to join the outputs from the 3 aggregations back onto the original data. It is assumed that the imputed(or original if it didnt need imputing) data is stored in an s3 bucket by the imputation module; and that each of the 3 aggregation processes each write their output to S3.
The combiner merely picks up the imputation data and the 3 files from the other aggregation stages from s3. It joins these all together and sends onwards. The result of which is that the next module(disclosure) has the granular input data with the addition of aggregations merged on.

*The exact column can be provided as a runtime variable.

es-aggregation-sg's People

Contributors

piwington avatar kingmushroom avatar krisrogos avatar jordancooke avatar dom-ford avatar thomashenson avatar glanvl avatar lukeglanville avatar mkeating avatar joelclemence avatar dependabot[bot] avatar

Watchers

James Cloos avatar Peter Hunter avatar  avatar  avatar Ben Latham avatar Roberto Nacu  avatar  avatar

Forkers

uk-gov-mirror

es-aggregation-sg's Issues

Period Column + Top2

During Testing It Was Remembered That Imputation Apply Only Contain One Period. Therefore All These Wranglers And Methods That Use The Period Column Are Redundant As There Will Never Be More Than One.

Combiner includes a temporary fix 15/11

    # !temporary due to the size of our test data.
    # This means that cells that didn't have any responders
    # to produce aggregations from, then the aggregations are not null
    # (breaking things)
    third_merge.fillna(1, inplace=True, axis=1)

Combiner includes the above fillna because, due to the size of our test data, certain cells were getting nulls after aggregations(caused by there being no responders in that cell to have been aggregated).

I think that once data is large then this would be redundant n not triggered, but good practice would be to remove it when its not needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.