Amazon EMR architecture

HDFS stores data across cluster instances with redundancy to prevent data loss and is ideal for MapReduce workloads

Amazon EMR supports open-source applications with their own cluster management systems, offering flexibility for specific use cases

Spark supports libraries like Spark SQL, MLlib, and GraphX, while MapReduce works with Java, Hive, and Pig for data processing

Amazon EMR supports applications like Hive, Pig, and Spark Streaming for tasks like data warehousing, machine learning, and stream processing

A cluster framework that caches datasets in memory and uses directed acyclic graphs for execution, offering faster performance than MapReduce

Amazon EMR uses built-in YARN node labels to assign the CORE label to core nodes, ensuring application masters are scheduled on stable nodes

A distributed computing framework that simplifies parallel application development using Map and Reduce functions for key-value pair processing

Amazon EMR ensures job stability by restricting application master processes to core nodes, preventing task failures when Spot Instances are interrupted

Amazon EMR uses YARN to manage cluster resources and schedule data processing jobs, ensuring efficient resource allocation

Locally attached instance storage is ephemeral and only lasts for the lifetime of the Amazon EC2 instance