Overview

To the spark perform work parallel, it split the data into chunks (partitions). A partition is collection of row or data, that sit on one physical machine in the cluster.

The core data structure is immutable, the spark do not carry out any transformation until we call an action. When an action have finished, It will return a new object and don’t modify previous object

There are 2 types transformation:

  • narrow dependencies: each input partition will contribute to only one output partition
  • wide dependencies: the input partition will contribute to many output partitions (shuffle)

Shuffle

Shuffle whereby Spark will exchange partitions across the cluster

How do the narrow and wide transformation work?

  • For the narrow transformation, the spark will automatically perform an operation call pipelining, that will performed in-memory.
  • For the wide transformation, the spark need to write the result to disk

In spark, reading data is also transformation, and is therefore a lazy operation

Sort

Sort does not modify the DataFrame. We use sort as a transformation that returns a new DataFrame by transforming the previous DataFrame. Let’s illustrate what’s happening when we call take on that resulting DataFrame

Spark uses an engine called Catalyst that maintains its own type information through the planning and processing of work. Spark types map directly to the different language APIs that Spark maintains and there exists a lookup table for each of these in Scala, Java, Python, SQL, and R.

References