aws-quickstart / quickstart-aws-utility-meter-data-analytics-platform Goto Github PK

AWS Quick Start Team

License: Apache License 2.0

Shell 1.34% Python 32.57% JavaScript 1.07% Jupyter Notebook 65.02%

quickstart-aws-utility-meter-data-analytics-platform's Introduction

[Defunct] Utility Meter Data Analytics on AWS

This repository for Utility Meter Data Analytics is no longer maintained or supported. For the currently supported repository, refer to Guidance for Meter Data Analytics on AWS.

quickstart-aws-utility-meter-data-analytics-platform's People

Contributors

Stargazers

Watchers

Forkers

claudiobizzotto iut62elec floraaws rahulynot annaone tbulding yougenius1989 jpedram johnmousa jdebo-slalom jwzhang411898961 censullo luisgradossalinas crobison-slalom yuanzjls saschajanssen michlitz dmschauer rekab-one

quickstart-aws-utility-meter-data-analytics-platform's Issues

New VPC Deploy all defaults/london data, workflow job transform_raw_to_clean-us-east-1 dies

Hi - Thanks for doing this Utility focused quick start! I decided to give it a try as I'm involved with MDM. In deploying the quick start, making sure to point out the password requirements for redshift would have been helpful. 2x failed deploys before I thought of choosing a password with upper,lower, and special characters.

I've deployed in a new VPC, with all defaults.

Once triggering the workflow, It gets as far as the transform_raw_to_clean-us-east-1 job, which dies with a TypeError: sequence item 0: expected str instance, NoneType found. Looking at the jobs "Error Logs", gives no useful clue as to the error. The "Logs" is full of output, though the only error i see is

20/10/28 23:20:58 WARN YarnClient: The GET request failed for the URL http://0.0.0.0:8088/ws/v1/cluster/apps/application_1603926903796_0001
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.HttpHostConnectException: Connect to 0.0.0.0:8088 [/0.0.0.0] failed: Connection refused (Connection refused)

The IP address of 0.0.0.0 on port 8088 looks strange to me. Any suggestions where to look to resolve the issue?
Thanks

Ken

The theory behind the ML model isn’t covered in the docs

I’ve spent a bit of time today looking at SageMaker and trying to understand what model, where it came from, why its used, how data is fed, how it outputs results etc. Maybe a section in the docs on the ML model would help?

State machine fails at the step to create SageMaker model

The execution role of state machine "MachineLearningPipelineModelTrainingStateMachine" should allow action “sagemaker:AddTags” in its inline policy because it looks like now step function wants to add metadata tags to whatever processes it is going to run. Adding it completed the state machine without throwing an error.

Unable to deploy MDA quickstart v2 due to S3 errors

Deploying the MDA v2 stack results in rollback due to multiple S3 errors:

DataStorageStack: S3 error: The specified key does not exist. For more information check http://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html

IntegrationStack: S3 error: The specified key does not exist.

CopyGlueScriptsStack: S3 error: The specified key does not exist.

Above errors for both MeterDataGenerator parameter 'Disabled' and 'Enabled' options.

Please advise on resolution.

Error handling in forecast endpoint returns “Internal Server Error” not the actual error

Suggest modifying the endpoint lambda to pass on a more helpful error message

python runtime 3.6 deprecated for Lambda function GenID

I tried to deploy the CF stack and received a failure in the Redshift stack deployment. The root cause is a lambda function created by a resource named "GenID" in the CF template that is utilizing the python3.6 runtime, which is now deprecated.

Additional issues in sample notebook mda_demos

I resolved the pyathena dependency mentioned in Issue #55 with the following command:
!pip install pyathena

I ran into some issues in the cell following header "Get forecast from pre-calculation result, can be one or many meters"
a. Sagemaker execution role needs permissions to start an Athena query.
b. variable "athena-output-bucket" needs to be updated to match the output bucket in S3.
c. in the query, the schema and table need to be updated, however I did not find a table named "forecast".
The cell under the header Get Outage failed with the error below:

KeyError Traceback (most recent call last)
in
13 data = json.loads(resp)
14
---> 15 if data['Items']:
16 df = pd.DataFrame(data['Items'])
17 df_result = df[['meter_id', 'error_value', 'lat', 'long']].drop_duplicates()

KeyError: 'Items'

London sample meter data is no longer available to download

The dataset mentioned for this implementation is currently not available for download. Is there an alternative source available to download this dataset? The instructions also don't mention the schema of this specific sample dataset so I'm not sure what I'm looking for on other websites.

License file needs updating

License file from commit cffdf3a needs updating.

generate_meter_data.py format

Hi,

Is there any reason why the meter data is output in the format that it is? It doesn't seem to match up with any of the formats in the ETL pipeline?

The dataset for the London data has been removed but taking the one from Kaggle you can see that the London data for example has a structure of:

LCLid,stdorToU,DateTime,KWH/hh (per half hour) ,Acorn,Acorn_grouped
MAC000036,Std,2012-11-08 10:30:00.0000000, 0.003 ,ACORN-E,Affluent

Whereas the generated data has something more like:

46001|2.8.3|2019010100240000|45.218|AGG|spid|rq|ansi|smult|DST|AccNum|sqcode
46001|1.8.1|2019010100240000|26.492|INT|spid|rq|ansi|smult|DST|AccNum|sqcode
46001|1.8.0|2019010100240000|31.191|INT|spid|rq|ansi|smult|DST|AccNum|sqcode

Is there a reason the fields are added to the end? Is this some other meter data format?
Just curious as the ability to generate meter data like this would be invaluable for our testing

Issues encountered running sample notebook mda_demos

There are a few issues in the sample Jupyter notebook mda_demos:

The Deployment Guide documentation does not specify which kernel to use. The notebook opens with no kernel. I tried using the conda_python3 kernel for these tests.
Solution requires altair module but there is no cell to install altair. I used this command successfully:
!pip install altair vega_datasets.
Solution requires pyathena but there is no cell to install it. I tried the command below but ran into dependency problems:
!conda install -c conda-forge pyathena
There are a few lines of code below a caption "Get forecast from pre-calculation result, can be one or many meters", but this code is not enclosed in a cell and cannot be executed unless the user creates a cell and pastes the code into it.

Documentation question - Geolocation schema

In the Quick Start documentation, there is a Table 2 that shows the geolocation schema. The first column is specified "meter id", however all of the other schemas are using "meter_id" (with an underscore). Can you please verify which is correct and update the documentation if necessary.

Issue in workflow

I get below error in the workflow. ETL glue job shows below type error in raw to clean data transform.
distinctDatesStrList = ','.join(value['date_str'] for value in distinctDates)
typeError: sequence item 0: expected str instance, NoneType found

Before I do anything from my side wanted to check if this is normal.

VPC Fails to create with when creating stack in cloud formation

I attempted to use the cf template to "Deploy Utility Meter Data Analytics into a new VPC on AWS" and the stack fails to create on the VPC. On the Logical ID of "VPC" I am getting a "S3 Error Access Denied" but unfortunately, no more information is provided in the log than that. the role has Admin Access should I would assume S3 Access Denied should not be a problem. Any suggestions for debugging this would be great.

Steps to retrain the model not complete

We found that retraining the model required the following steps. The JSON provided in the quick start was not complete:

When you retrain the model, do the following:
Open Step Functions → State Machines → InitialTrainingStateMachine (click)
Click Start Execution and paste in the following into the input - change the ML ModelName and ML_Endpoint_name to match the ML Model name. Also provide a unique Training_job_name e.g. training-job-nick-0001 (if you need to run the job again, you must increment the number to make it unique)

{
"ModelName": "ml-model-fafa9336-6f1d-4fa4-bbfe-331afa384dc4",
"ML_endpoint_name": "ml-endpoint-a2544bb9-4d4d-4fd8-bed5-81eab8b0cf31",
"Training_job_name": "training-job-nick-0001",
"Data_start": "2013-01-01",
"Data_end": "2013-10-01",
"Meter_start": 1,
"Training_samples": 50,
"Forecast_period": 7,
"Endpoint_instance_type": "ml.m5.xlarge",
"Training_instance_type": "ml.c5.2xlarge",
"Meter_end": 100,
"Batch_size": 25
}

It's too easy to mis-configure the required parameters

Step 4 of Launch the Quick Start

"On the Specify stack details page, change the stack name if needed. Review the parameters for the template. Provide values for the parameters that require input. For all other parameters, review the default settings and customize them as necessary. For details on each parameter, see the Parameter reference section of this guide."

There are 4 'gotchas' we tripped up on that are only revealed when deploying once, see it fail and rollback then configuring the rollback options then deploying again and finally seeing the error. Suggest adding validation earlier to the CFN template and making the documentation clearer which options are required.

The 'london' transformer is not documented clearly in the quick-start, we only found that by tracking down the github issue.

Parameter label (name)	Default value	Description
Master user name (MasterUsername)	Requires input	Master user name for the Amazon Redshift cluster. The user name must be lowercase, begin with a letter, contain only alphanumeric characters, '_', '+', '.', '@', or '-', and be less than 128 characters.
Master user password (MasterUserPassword)	Requires input	Master user password for the Amazon Redshift cluster. The password must be 8 - 64 characters, contain at least one uppercase letter, at least one lowercase letter, and at least one number. It can only contain ASCII characters (ASCII codes 33-126), except ' (single quotation mark), " (double quotation mark), /, , or @.
Availability Zones (AvailabilityZones)	Requires input	List of Availability Zones to use for the subnets in the VPC. The logical order is preserved. You must provide two zones, according to AWS best practices. If a third Availability Zone parameter is specified, you must also provide that zone.
Transformer that reads the landing-zone data (LandingzoneTransformer)	london	Defines the transformer for the input data in the landing zone.