👩‍💻 Loi Nguyen Second Brain

❯

❯

1. Why Apache Spark RDD is immutable?

1. Why Apache Spark RDD is immutable?

227 words, 2 min read
Published on 2024-11-30T23:42:00
🌟 Edit This Page! 🗓️ History

data-engineer
spark

What is RDD?

RDD stands for Resilient Distributed Dataset.
An abstraction Spark represent a large collection of data distributed across a node cluster.
RDD stored in partitions on different nodes across the cluster.
A partition is essentially a logical chunk of a large distributed dataset.
Partitions are primary unit of Spark.

Why are RDD immutable?

Important

RDDs are immutable - They can not changed once created and distributed across the cluster’s memory.

Functional Programming Influence

Immutability ensures that once an RDD created, it can not be changed.
Any operation on an RDD, it will create a new an RDD.
So, enhance performance and maintain consistency in distributed system.

Support for concurrent consumption

RDD designed to support concurrent processing in distributed system
Immutability ensures that data remains consistent across threads.
Eliminating the need for complex synchronization mechanisms and reducing the risk of race conditions.

In-Memory Computing

Immutable data structures eliminate the need for frequent cache invalidation, making it easier to maintain consistency and reliability in a high-performance computing environment.

Lineage and Fault Tolerance

RDDs provide fault tolerance through a lineage graph - a record of the series of transformations that have been applied.
Lineage allows Spark to reconstruct a lost or corrupted RDD by tracing back through its transformation history.
Immutability ensures that the lineage information remains intact and allows Spark to recompute lost data reliably.

Further Exploration: Grokking Concurrency

Graph View

What is RDD?
Why are RDD immutable?
Further Exploration: Grokking Concurrency

Backlinks

2. Spark Introduction
9. Resilient Distributed Datasets (RDDs)

Created with Quartz v4.5.1 © 2025

GitHub