How to make a DataFrame from RDD in PySpark?

Method 1

  • rdd = sc.parallelize([(1,2,3),(4,5,6),(7,8,9)])
  • df = rdd.toDF([“a”,”b”,”c”])

All you need is that when you create RDD by parallelize function, you should wrap the elements who belong to the same row in DataFrame by a parenthesis, and then you can name columns by toDF in which all the columns? names are wraped by a square bracket.

Method 2

  • from pyspark.sql import Row
  • rdd = sc.parallelize([Row(a=1,b=2,c=3),Row(a=4,b=5,c=6),Row(a=7,b=8,c=9)])
  • df = rdd.toDF()

It also works, but I think it is a sort of verbose. And you should also watch out for the columns? names in each Row when you create an RDD, they are just names that are not in the format of string. And then, after you?ve assigned a name for each element in every Row, then you can convert the RDD into a dataframe just by toDF function in which there is no any other names.

I recommend Method 1 for you because of its simplicity, clarity and awesome intelligibility.


No Responses

Write a response