awsdocs / amazon-emr-management-guide Goto Github PK

The open source version of the Amazon EMR Management Guide. You can submit feedback & requests for changes by submitting issues in this repo or by making proposed changes & submitting a pull request.

License: Other

amazon-emr-management-guide's Introduction

NOTICE

This repository is archived, read-only, and no longer updated. For more information, read the announcement on the AWS News Blog.

You can find up-to-date AWS technical documentation on the AWS Documentation website, where you can also submit feedback and suggestions for improvement.

amazon-emr-management-guide's People

Contributors

Stargazers

Watchers

Forkers

gtitievsky writerankur daniel-artchounin anilsener kritijha schmutze brmur caedus41 wolruf changlees damian-lukasik sangeeyeah imujjwal96 cs0101 ceasarjackson bukkasamudram ms-choudhary joelthompson bcj6483 maorfr awsbigdata maestre3d pahtoe abhimanyu3-zz ggallo dithn devopseze ericabertugli kshrivastava-r7 michaelcraige xman1980 aws-samples-and-tutorials john-aws ssheff bendrucker christopherhackett ew-meetup yegeniy alessiosavi shashikumarec088 jonsnowseven emediacode yk-st dicksonj raj95 ashu82492 d3v3l0 jurikolo deenbandhu1 sumanvalusa patrick-muller l-mir shashisingh zakkhishkav mrteutone programmer-ke aa2858 konkerama ramakrpu atdavidpark jinkwon711 mialwgh gabofdc fireboltjeff kleytonhsantos djfurman akpanda147 jbelmont wilsoncwip spark-bigdata hardikpatel29 frankfanslc wallacelim chanlawrencet mjrstell fvazquez-caylent hyurt jorgecoa patrika1979 anish-moorjani scalaster albertosilvabr

amazon-emr-management-guide's Issues

The link to CloudWatch evcents type is broken.

There is a broken link Amazon EMR Events at the bottom: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-cloudwatch-events.html

The source: https://github.com/awsdocs/amazon-emr-management-guide/blob/master/doc_source/emr-manage-cloudwatch-events.md

Please update that link.

Write data to an Amazon S3 bucket you don't own using Spark

In https://github.com/awsdocs/amazon-emr-management-guide/blob/master/emr-s3-acls.md you discuss how to work with canned ACLs. I'm facing this issue when working with Spark. Can you explain in the documentation how to set fs.s3.canned.acl (to BucketOwnerFullControl for example) from a Spark application written in Java?

HA Supported Applications and Features Page Contains an en-dash instead of a normal dash

I would PR this in, but the docs haven't been ported here yet :)

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-applications-HDFS states:

If you need to find out which NameNode is active, you can use SSH to connect to any master node in the cluster and run the following command:

hdfs haadmin –getAllServiceState

If you stare closely, the character before "getAllServiceState" is an en-dash (–) rather than a normal dash (-). If you try to copy/paste it into a command-line window, you get this:

$ hdfs haadmin –getAllServiceState
Bad command '–getAllServiceState': expected command starting with '-'
...

If you replace the – with a - it works just fine:

$ hdfs haadmin -getAllServiceState
ip-XX-XX-XX-XX.ec2.internal:8020                  active    
ip-XX-XX-XX-XX.ec2.internal:8020                  standby

Canned Acls not working for hive if you don't restart the metastore

hive> set fs.s3.canned.acl=BucketOwnerFullControl;
create table acl (n int) location 's3://acltestbucket/acl/';
insert overwrite table acl select count(n) from acl;

I believe s3:HeadBucket is not a valid IAM action

In emr-iam-role-for-ec2.md, s3:HeadBucket is suggested.

According to IAM UserGuide list for S3 and S3 HeadBucket API documentation, the corresponding IAM action we need here is s3:ListBucket.

Document required connectivity for LocalDiskEncryptionKeyProvider type AwsKms

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-create-security-configuration.html and https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html#emr-awskms-keys discuss using KMS CMKs for EMR encryption. However, there is no mention that the main EC2 instances themselves require network connectivity to KMS when using AwsKms for the local disk encryption (either over the internet or over a VPC Endpoint). Having this spelled out explicitly would be helpful.

Amazon EMR-Managed Security Groups terminology

https://github.com/awsdocs/amazon-emr-management-guide/blob/master/doc_source/emr-man-sec-groups.md references master and slave terminology which must be replaced by manager and worker.

access denied spark emr -> s3

the instructions here lead to an access denied when calling s3.

I assume we need to add some bucket/iam policy to the s3 bucket to get a success

I've been reading through docs on iam/bucket policies for a couple hours with little progress.

Any pointers?

Github docs are out of date compared to published AWS docs

The last commit was Jan 2020. The Github docs are way out of date with the published EMR documentation.

Has Github been deprecated as a repo for documentation?

Slight inconsistency in DynamoDB read capacity

There seems to be a slight inconsistency in the read capacity allocated in DynamoDB. One document says the RCU is 500, the other says 400.

From emr-plan-consistent-view.md:

the DynamoDB database has 500 read capacity and 100 write capacity

From emrfs-metadata.md:

EMRFS sets default throughput capacity limits on the metadata for its read and write operations at 400 and 100 units, respectively

Wrong explanation

"consider using an m4.xlarge, because vCores in m4.xlarge are twice that of m5.xlarge" on http://docs.aws.amazon.com/en_us/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

should change "consider using an m5.xlarge, because vCores in m5.xlarge are twice that of m5.large"

More in-depth security discussion of EMRFS AssumeRole support

It's unclear to me in the documentation today what the security model for EMRFS AssumeRole is. As far as I've been able to gather, the node running a task will transparently call AssumeRole on your behalf if you request a certain prefix or match a user pattern.

However, this seems more like a convenience mechanism than a strong security one (i.e., there's no mechanism in place to stop a task from hitting an S3 bucket with the instance credentials or another assumeable role). Is that correct? It seems worth spelling out more explicitly in the documentation to make people don't use it in an attempt to make security boundaries.

Information on essential EC2 node IAM permissions

The management guide has a fairly extensive guide to IAM permissions (including sub-pages of that), but as far as I can tell, seems to be lacking a fairly important piece of information: what EMR nodes actually need to do their job, independent of tasks running on top of them.

Right now the guidance seems to be roughly, "use the arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role managed policy, or you can customize your permissions, especially as it pertains to a security configuration that configures AssumeRole for EMRFS".

But arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role is actually pretty powerful:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Resource": "*",
            "Action": [
                "cloudwatch:*",
                "dynamodb:*",
                "ec2:Describe*",
                "elasticmapreduce:Describe*",
                "elasticmapreduce:ListBootstrapActions",
                "elasticmapreduce:ListClusters",
                "elasticmapreduce:ListInstanceGroups",
                "elasticmapreduce:ListInstances",
                "elasticmapreduce:ListSteps",
                "kinesis:CreateStream",
                "kinesis:DeleteStream",
                "kinesis:DescribeStream",
                "kinesis:GetRecords",
                "kinesis:GetShardIterator",
                "kinesis:MergeShards",
                "kinesis:PutRecord",
                "kinesis:SplitShard",
                "rds:Describe*",
                "s3:*",
                "sdb:*",
                "sns:*",
                "sqs:*",
                "glue:CreateDatabase",
                "glue:UpdateDatabase",
                "glue:DeleteDatabase",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:UpdatePartition",
                "glue:DeletePartition",
                "glue:BatchDeletePartition",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:BatchGetPartition",
                "glue:CreateUserDefinedFunction",
                "glue:UpdateUserDefinedFunction",
                "glue:DeleteUserDefinedFunction",
                "glue:GetUserDefinedFunction",
                "glue:GetUserDefinedFunctions"
            ]
        }
    ]
}

Basically, it can do anything to S3, SNS, SQS, SDB, DynamoDB, and several other potentially scary things.

So if you don't fully trust your EMR tasks not to delete/corrupt all your S3 buckets and Dynamo tables, you probably want to customize that policy. But the documentation doesn't make a clear distinction between what EMR itself needs and some speculative permissions on what tasks running on top of it might want.

As far as I've been able to tell, these are completely unused by EMR itself:

DynamoDB
SDB
Kinesis

And S3 is at least partially used to upload logs to the configured logging bucket. Of course, if I tell my EMR job to fetch from s3://foo/bar, I'll need to also include permissions for that in my policy, but that separation is not very crisp right now.

It's also very hard for me to assess whether SNS/SQS is used internally by EMR today because both services have cross-account support so even if I see no relevant queues or topics in my account, I can't say with confidence that I'm not hobbling some uncommon EMR feature by not granting EMR access to those services.

The best experiment I've been able to run is to put the whole thing in a private subnet with no internet access and an S3 VPCE to send logs to S3. The EMR cluster seems quite content in that scenario, which suggests to me that everything but S3 is optional. But obviously if I were to tell an EMR package to fetch from Glue, that would break.

Ultimately, it would be nice to have a broken down table in the documentation saying things like (e.g.,) :

You always need S3 PutObject and ListBucket powers over your configured logging prefix.
If you want to use our Glue integration, you need permissions X, Y, Z on the instance IAM role
If you want to use our EMRFS AssumeRole powers, you need to grant AssumeRole powers to the instance IAM role

Or absent that (but this isn't a documentation thing), a cleaner separation between "task powers" and "EMR machinery powers" like what we have in ECS.

Incorrect instance type recommendation

The documentation at https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html does not match the content in this repo and there seems to be a typo in the live documentation around the recommended instance type.

"The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m5.xlarge instance. For clusters of more than 50 nodes, consider using an m4.xlarge."

I think it should have been
"The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m5.large instance. For clusters of more than 50 nodes, consider using an m5.xlarge."

EMR `Consistent View` not relevant anymore

Now that S3 is strongly consistency by default across regions.

Amazon S3 Update – Strong Read-After-Write Consistency
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

The below EMR documentation is not relevant anymore.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

Changes have to be done to the EMR documentation to reflect the same.

Document Required KMS Permissions

This is a bit similar to #9 but not fully included in it -- https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-encryption-enable.html states:

The role for the Amazon EC2 instance profile must have permissions to use the CMK you specify.

However, it doesn't actually specify what those permissions are. It then states:

You can use the AWS Management Console to add your instance profile or EC2 instance profile to the list of key users for the specified AWS KMS CMK, or you can use the AWS CLI or an AWS SDK to attach an appropriate key policy.

It then walks through going through the AWS console to add the role as a "Key User" but it doesn't actually specify what the required permissions are, nor does it ever describe how one would use the AWS CLI or an AWS SDK to grant appropriate permissions. Can the required KMS permissions please be documented so we can more easily manage them in code?

Thanks!

Guide to creating cross-realm trust between EMR and AWS Managed AD

I'm having a lot of trouble getting the finicky details working properly to connect an EMR in my account to an AWS Managed Microsoft AD in the same account. In theory all the various knobs are in place, but a step-by-step guide would be pretty nice, especially if it included an overview of aws-cli or the relevant API calls, to ease automation.

It's complicated a bit by the fact that the managed AD doesn't let you run the commands described here on it, like netdom trust EC2.INTERNAL /Domain:ad.domain.com /add /realm /passwordt:MyVeryStrongPassword, and instead exposes the trust machinery through an AWS API.

Basic user session policy contains invalid actions

The example role:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-studio-user-role.html#emr-studio-basic-session-policy

https://github.com/awsdocs/amazon-emr-management-guide/blob/main/doc_source/emr-studio-user-role.md#example-basic-user-session-policy

Contains invalid actions, according to the console and documentation the following don't exist:

AttachEditor
DetachEditor
CreatePersistentAppUI
DescribePersistentAppUI
GetPersistentAppUIPresignedURL
GetOnClusterAppUIPresignedURL
CreateAccessTokenForManagedEndpoint

https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelasticmapreduce.html

Encryption At Rest Options Seem to Work on HA Masters

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-applications.html#emr-plan-ha-features-unsupported states:

The following EMR features are currently not available in an EMR cluster with multiple master nodes:
...
* At-rest and in-transit encryption options

However, I successfully spun up a multi-master EMR cluster with encryption-at-rest options. I even terminated the active master node (at least, the active Yarn RM node) and a new one replaced it and re-applied the encryption at rest options.

How to update the emr master dns whenever the cluster terminates

Hi
In case we terminate the emr cluster and spinup the new one within a vpc, a differnt ip address and the dns for the master node, resource manager and all gets changed. Inorder to have a friendly name name and also to point the current running emr cluster i see below document
https://aws.amazon.com/blogs/big-data/dynamically-create-friendly-urls-for-your-amazon-emr-web-interfaces/
We are using our on prem DNS and apart from this we cannot have any other way? If we create a vpc endpoint to emr cluster and do the dns alias to the vpc endpoint that will not solve the purpose?