Amazon EMR

Amazon EMR simplifies running big data frameworks like Apache Hadoop and Spark on AWS for business intelligence and analytics

Jobs can be submitted as steps during cluster creation, through the EMR UI, API, or CLI, or interactively via SSH to the primary node

Data is processed in sequential steps, with each step performing specific tasks like data manipulation or querying

Data is typically stored in HDFS or Amazon S3, processed through steps, and output to a designated location like an S3 bucket

Steps transition through states (PENDING, RUNNING, COMPLETED, or FAILED), with options to cancel, retry, or terminate the cluster on failure

Clusters progress through states: STARTING, BOOTSTRAPPING, RUNNING, WAITING, TERMINATING, and TERMINATED

Custom scripts can be executed during cluster setup to install additional software or configure instances

EMR supports pre-installed applications like Hive, Hadoop, and Spark for data processing

Clusters can auto-terminate after completing steps or remain in a WAITING state for manual shutdown

EMR supports custom AMIs, hardware configurations, and termination protection for enhanced control and recovery