Apache Spark: DataFrame vs. RDD

Sasha Solomon
2 min readMar 8, 2017

In Spark code, you may have seen DataFrame and RDD used similarly and wondered “What’s the actual difference between the two?”

While used similarly, there are some important differences between DataFrames and RDDs. DataFrames require a schema and you can think of them as “tables” of data. RDDs are less structured and closer to Scala collections or lists.

However, the biggest difference between DataFrames and RDDs is that operations on DataFrames are optimizable by Spark whereas operations on RDDs are imperative and run through the transformations and actions in order.

An RDD (Resilient Distributed Dataset) is a sequence of operations to be executed in a distributed manner.

For example, if we create an RDD and do some operations on it:

val rdd1 = sc.parallelize(data1)
val rdd2 = sc.parallelize(data2)
rdd1
.join(rdd2)
.filter(name => name = "pikachu")
.cache

Our RDD will actually be a sequence of operations performed in the order specified:

A DataFrame is implemented as an RDD under the hood: it also results in a list of operations to be executed. The main difference is that it is an optimized list of operations.

The operations you choose to perform on a DataFrame are actually run through an query optimizer (Catalyst) with a list of rules to be applied to the DataFrame, as well as put into a specialized format for CPU and memory efficiency (Tungsten). With both of these, the outputted query plan is highly optimized.

For example, if we create a DataFrame with some operations:

val dataframe1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("data1.csv")
val dataframe2 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("data2.csv")
dataframe1
.join(dataframe2, dataframe1("id") === dataframe2("id")))
.filter("name = 'pikachu'")
.cache

Our DataFrame will actually be run through an optimizer, which will create a query plan with the operations rearranged to be more efficient without changing the results.

In general, DataFrames should be used over RDDs because they are highly optimized. You might use an RDD instead of a DataFrame (i.e. dataframe.toRDD) if you need to have control over the flow of the query plan.

Perhaps the most common reason RDDs are used in older code is because DataFrames are relatively new (April 2016). If this is the case, switching to DataFrames may be quite beneficial!

--

--

Sasha Solomon

software engineer @twitter, previously @medium. doing scala + graphql. pokemon gym leader. potato compatible. @sachee