PySpark : Inserting row in Apache Spark Dataframe.

user January 29, 2023 Leave a Comment

In PySpark, you can insert a row into a DataFrame by first converting the DataFrame to a RDD (Resilient Distributed Dataset), then adding the new row to the RDD, and finally converting the RDD back to a DataFrame.

Here is an example:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Insert Row").getOrCreate()
# Create a DataFrame
df = spark.createDataFrame([
(1, "Michele John", 25), 
(2, "Barry Berner", 30), 
(3, "Jacob Jim", 35)], 
["id", "name", "age"])
df.show()

Input data

+---+------------+---+
| id|        name|age|
+---+------------+---+
|  1|Michele John| 25|
|  2|Barry Berner| 30|
|  3|   Jacob Jim| 35|
+---+------------+---+

# Convert the DataFrame to an RDD
rdd = df.rdd
# Add the new row to the RDD
new_row = (4, "Elaine Berer Lee", 22)
rdd = rdd.union(spark.sparkContext.parallelize([new_row]))
# Convert the RDD back to a DataFrame
df = spark.createDataFrame(rdd, df.schema)
# Show the DataFrame
df.show()

Result

+---+----------------+---+
| id|            name|age|
+---+----------------+---+
|  1|    Michele John| 25|
|  2|    Barry Berner| 30|
|  3|       Jacob Jim| 35|
|  4|Elaine Berer Lee| 22|
+---+----------------+---+

This code creates a DataFrame with three rows and three columns, then converts it to an RDD. Then it creates a tuple with the values for a new row, and add it to the RDD using the union() method. Finally, it converts the RDD back to a DataFrame using the same schema as the original DataFrame, and shows the resulting DataFrame. The resulting DataFrame will have the new row inserted at the bottom of the DataFrame.

It’s worth noting that this method of inserting a row is not efficient for large DataFrames, if you need to insert a large number of rows, it’s better to use Spark SQL or DataFrame API to insert a new row.

Spark important urls to refer

Post Views: 622

Author: user

PySpark : Inserting row in Apache Spark Dataframe.

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Most Viewed Posts

Related Posts

Related Articles

Leave a Reply Cancel reply

Trending

Recent Posts

Featured Posts – Slider Widget