Difference between RDD , DF and DS in Spark

Knoldus Blogs

In this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry after this blog everything will be clear.

With Spark2.0 release, there are 3 types of data abstractions which Spark officially provides now to use: RDD, DataFrame and DataSet.

so let’s start some discussion about it.

Resilient Distributed Datasets (RDDs) – Rdd is is a fault-tolerant collection of elements that can be operated on in parallel.
By the rdd, we can perform operations on data on the different nodes of the same cluster parallelly so it’s helpful in increasing the performance.

How we can create the RDD

Spark context(sc) helps to create the rdd in the spark. it can create the rdd from –

  1. external storage system like HDFS, HBase, or any data source offering a Hadoop InputFormat.
  2. parallelizing an…

View original post 659 more words


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s