PySpark : Inserting row in Apache Spark Dataframe.

PySpark @ Freshers.in

In PySpark, you can insert a row into a DataFrame by first converting the DataFrame to a RDD (Resilient Distributed Dataset), then adding the new row to the RDD, and finally converting the RDD back to a DataFrame.

Here is an example:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Insert Row").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([
(1, "Michele John", 25), 
(2, "Barry Berner", 30), 
(3, "Jacob Jim", 35)], 
["id", "name", "age"])
df.show()
Input data
+---+------------+---+
| id|        name|age|
+---+------------+---+
|  1|Michele John| 25|
|  2|Barry Berner| 30|
|  3|   Jacob Jim| 35|
+---+------------+---+
# Convert the DataFrame to an RDD
rdd = df.rdd
# Add the new row to the RDD
new_row = (4, "Elaine Berer Lee", 22)
rdd = rdd.union(spark.sparkContext.parallelize([new_row]))
# Convert the RDD back to a DataFrame
df = spark.createDataFrame(rdd, df.schema)
# Show the DataFrame
df.show()
Result
+---+----------------+---+
| id|            name|age|
+---+----------------+---+
|  1|    Michele John| 25|
|  2|    Barry Berner| 30|
|  3|       Jacob Jim| 35|
|  4|Elaine Berer Lee| 22|
+---+----------------+---+

This code creates a DataFrame with three rows and three columns, then converts it to an RDD. Then it creates a tuple with the values for a new row, and add it to the RDD using the union() method. Finally, it converts the RDD back to a DataFrame using the same schema as the original DataFrame, and shows the resulting DataFrame. The resulting DataFrame will have the new row inserted at the bottom of the DataFrame.

It’s worth noting that this method of inserting a row is not efficient for large DataFrames, if you need to insert a large number of rows, it’s better to use Spark SQL or DataFrame API to insert a new row.

Author: user

Leave a Reply