Amazon EMR cluster

Amazon EMR enables rapid setup of scalable clusters for big data processing using frameworks like Spark

Before launching an EMR cluster, you must prepare your application, input data, and storage (typically an S3 bucket in the same region)

Data is stored in Amazon S3 using EMRFS, and bucket names must follow AWS naming conventions

You can launch an EMR cluster via the AWS Management Console or AWS CLI, specifying Spark as the application

Cluster configuration includes setting release version, instance type/count, permissions, and log storage location

IAM roles (EMR_DefaultRole, EMR_EC2_DefaultRole) are required for cluster operation and can be created with the AWS CLI

After creation, monitor cluster status as it transitions from STARTING to RUNNING to WAITING

Security groups must be configured to allow SSH access to the master node, ideally restricted to trusted IPs

SSH into the master node to access logs or submit jobs; avoid public SSH access for security

Use the AWS CLI for advanced management, including cluster creation, status checks, and SSH connections