How to add a new column in PySpark using withColumn

PySpark @ Freshers.in

withColumn

Syntax:
DataFrame.withColumn(column_name, col)

withColumn is comonly used to add a column on an existing dataframe. withColumn returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame. Attempting to add a column from some other DataFrame will raise an error. If you attemp to add a column from some other DataFrame will raise an error. withColumn method introduces a projection internally. So if this is called multiple times, in order to add multiple columns can generate big plans which may cause performance issues. So try avoiding calling this function inside a loop, if the loop is large.

#Sample Code
from pyspark.sql import SparkSession
from pyspark.sql.types import MapType
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
from pyspark.sql.functions import current_date,concat_ws,col,lit
emp_data =[
("Finance",("Sam","James","John"),101),
("Technology",("Michael","David","William"),201),
("Human_Resource",("Richard","","John"),301),
("Administrative",("Thomas","Christopher","Daniel"),401),
("Contractors",("Donald","Kenneth","Kevin"),501),
]
emp_data_schema = StructType([
StructField("dept",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("dept_id",IntegerType(),True),])
emp_df_1 = spark.createDataFrame(data=emp_data,schema=emp_data_schema)
emp_df_1.printSchema()
"""
root
 |-- dept: string (nullable = true)
 |-- name: struct (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- middle_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |-- dept_id: integer (nullable = true)
"""
emp_df_1.show(20,False)
"""
+--------------+-----------------------------+-------+
|dept          |name                         |dept_id|
+--------------+-----------------------------+-------+
|Finance       |[Sam, James, John]           |101    |
|Technology    |[Michael, David, William]    |201    |
|Human_Resource|[Richard, , John]            |301    |
|Administrative|[Thomas, Christopher, Daniel]|401    |
|Contractors   |[Donald, Kenneth, Kevin]     |501    |
+--------------+-----------------------------+-------+
"""
#Adding date to a column and concatinating first name and last name 
emp_df_2 = emp_df_1.withColumn("processed_date",current_date())\
				   .withColumn("emp_full_name",concat_ws(" ",emp_df_1.name.first_name,emp_df_1.name.last_name))
emp_df_2.show(20,False)
"""
+--------------+-----------------------------+-------+--------------+---------------+
|dept          |name                         |dept_id|processed_date|emp_full_name  |
+--------------+-----------------------------+-------+--------------+---------------+
|Finance       |[Sam, James, John]           |101    |2022-02-16    |Sam John       |
|Technology    |[Michael, David, William]    |201    |2022-02-16    |Michael William|
|Human_Resource|[Richard, , John]            |301    |2022-02-16    |Richard John   |
|Administrative|[Thomas, Christopher, Daniel]|401    |2022-02-16    |Thomas Daniel  |
|Contractors   |[Donald, Kenneth, Kevin]     |501    |2022-02-16    |Donald Kevin   |
+--------------+-----------------------------+-------+--------------+---------------+
"""
#Concatinating two column values with a seperator and adding constant value to a new column
emp_df_3 = emp_df_2.withColumn("dept_with_id",concat_ws("_",emp_df_2.dept,emp_df_2.dept_id)).withColumn("batch_no",lit("batch_001"))
emp_df_3.show(20,False)
"""
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
|dept          |name                         |dept_id|processed_date|emp_full_name  |dept_with_id      |batch_no |
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
|Finance       |[Sam, James, John]           |101    |2022-02-16    |Sam John       |Finance_101       |batch_001|
|Technology    |[Michael, David, William]    |201    |2022-02-16    |Michael William|Technology_201    |batch_001|
|Human_Resource|[Richard, , John]            |301    |2022-02-16    |Richard John   |Human_Resource_301|batch_001|
|Administrative|[Thomas, Christopher, Daniel]|401    |2022-02-16    |Thomas Daniel  |Administrative_401|batch_001|
|Contractors   |[Donald, Kenneth, Kevin]     |501    |2022-02-16    |Donald Kevin   |Contractors_501   |batch_001|
+--------------+-----------------------------+-------+--------------+---------------+------------------+---------+
"""

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply