Spark Datasets vs DataFrames vs RDDs:

Spark Datasets vs DataFrames vs RDDs:

Image for post

Many may have been asking yourself why they should be using Datasets rather than the foundation of all Spark ? RDDs using case classes.

This document collects advantages of Dataset vs RDD[CaseClass] to answer the question Dan has asked on twitter:

?In #Spark, what is the advantage of a DataSet over an RDD[CaseClass]?

Saving to or Writing from Data Sources:

With Dataset API, loading data from a data source or saving it to one is as simple as using SparkSession.read or Dataset.write methods, appropriately.

Accessing Fields / Columns:

You select columns in a datasets without worrying about the positions of the columns.

In RDD, you have to do an additional hop over a case class and access fields by name.

Before starting the comparison between Spark RDD vs DataFrame vs Dataset, let us see RDDs, DataFrame and Datasets in Spark:

  • Spark RDD APIs ? An RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations on large clusters in a fault-tolerant manner. Thus, speed up the task. Follow this link to learn Spark RDD in great detail.
  • Spark DataFrame APIs ? Unlike an RDD, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. Follow this link to learn Spark DataFrame in detail.
  • Spark Dataset APIs ? Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-oriented programming interface. Dataset takes advantage of Spark?s Catalyst optimizer by exposing expressions and data fields to a query planner. Follow this link to learn Spark DataSet in detail.

Data Representation:

  • RDD ? RDD is a distributed collection of data elements spread across many machines in the cluster. RDDs are a set of Java or Scala objects representing data.
  • DataFrame ? A DataFrame is a distributed collection of data organized into named columns. It is conceptually equal to a table in a relational database.
  • DataSet ? It is an extension of DataFrame API that provides the functionality of ? type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off heap storage mechanism of a DataFrame API.

Data Formats:

  • RDD ? It can easily and efficiently process data which is structured as well as unstructured. But like Dataframe and DataSets, RDD does not infer the schema of the ingested data and requires the user to specify it.
  • DataFrame ? It works only on structured and semi-structured data. It organizes the data in the named column. DataFrames allow the Spark to manage schema.
  • DataSet ? It also efficiently processes structured and unstructured data. It represents data in the form of JVM objects of row or a collection of row object. Which is represented in tabular forms through encoders.

Data Sources API:

  • RDD ? Data source API allows that an RDD could come from any data source e.g. text file, a database via JDBC etc. and easily handle data with no predefined structure.
  • DataFrame ? Data source API allows Data processing in different formats (AVRO, CSV, JSON, and storage system HDFS, HIVE tables, MySQL). It can read and write from various data sources that are mentioned above.
  • DataSet ? Dataset API of spark also support data from different sources.

Optimization:

  • RDD ? No inbuilt optimization engine is available in RDD. When working with structured data, RDDs cannot take advantages of sparks advance optimizers. For example, catalyst optimizer and Tungsten execution engine. Developers optimize each RDD on the basis of its attributes.
  • DataFrame ? Optimization takes place using catalyst optimizer. DataFrames use catalyst tree transformation framework in four phases:

a) Analyzing a logical plan to resolve references.

b) Logical plan optimization.

c) Physical planning.

d) Code generation to compile parts of the query to Java byte code.

Serialization:

  • RDD ? Whenever Spark needs to distribute the data within the cluster or write the data to disk, it does so use Java serialization. The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes.
  • DataFrame ? Spark DataFrame Can serialize the data into off-heap storage (in memory) in binary format and then perform many transformations directly on this off heap memory because spark understands the schema. There is no need to use java serialization to encode the data. It provides a Tungsten physical execution backend which explicitly manages memory and dynamically generates byte code for expression evaluation.
  • DataSet ? When it comes to serializing data, the Dataset API in Spark has the concept of an encoder which handles conversion between JVM objects to tabular representation. It stores tabular representation using spark internal Tungsten binary format. Dataset allows performing the operation on serialized data and improving memory use. It allows on-demand access to individual attribute without deserializing the entire object.

Garbage Collection:

  • RDD ? There is overhead for garbage collection that results from creating and destroying individual objects.
  • DataFrame ? Avoids the garbage collection costs in constructing individual objects for each row in the dataset.
  • DataSet ? There is also no need for the garbage collector to destroy object because serializationtakes place through Tungsten. That uses off heap data serialization.

Schema Projection:

  • RDD ? In RDD APIs use schema projection is used explicitly. Hence, we need to define the schema (manually).
  • DataFrame ? Auto-discovering the schema from the files and exposing them as tables through the Hive Meta store. We did this to connect standard SQL clients to our engine. And explore our dataset without defining the schema of our files.
  • DataSet ? Auto discover the schema of the files because of using Spark SQL engine.

Aggregation:

  • RDD ? RDD API is slower to perform simple grouping and aggregation operations.
  • DataFrame ? DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets.
  • DataSet ? In Dataset it is faster to perform aggregation operation on plenty of data sets.

Conclusion

Hence, from the comparison between RDD vs DataFrame vs Dataset, it is clear when to use RDD or DataFrame and/or Dataset.As a result, RDD offers low-level functionality and control. The DataFrame and Dataset allow custom view and structure. It offers high-level domain-specific operations, saves space, and executes at high speed. Select one out of DataFrames and/or Dataset or RDDs APIs, that meets your needs and play with Spark.

Note: Congrats! You just learned how to use RDD/DF/DS in your Spark Code. If you like please follow me Upendra Nallabolu.

23