What is the difference between concat and concat_ws in Pyspark

PySpark @ Freshers.in

concat vs concat_ws

Syntax:
pyspark.sql.functions.concat(*cols)
pyspark.sql.functions.concat_ws(sep, *cols)

concat : concat concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
concat_ws : concat_ws concatenates multiple input string columns together into a single string column, using the given separator.
concat returns None when any of the column/columns is null. All keys in the column is lost.
concat_ws returns non empty column’s after concatenating with the separator.
In short in general scenario concat_ws is recommended rather than concat.

#Sample Code 
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
from pyspark.sql.functions import concat,concat_ws,col,lit
car_emp =[
("Finance",("Tom","Jack","Mark"),101),
("Technology",("Barry","Mike","Jacob"),None),
("Human_Resource",("Tim","","More"),301),
("Administrative",("Edisson","","Kevin"),None),
("Contractors",("Trunp","Mark","Elvin"),501),
]
car_emp_schema = StructType([
StructField("dept",StringType(),True),
StructField("name",StructType([
StructField("first_name",StringType(),True),
StructField("middle_name",StringType(),True),
StructField("last_name",StringType(),True)])),
StructField("dept_id",IntegerType(),True),])
car_emp_df_1 = spark.createDataFrame(data=car_emp,schema=car_emp_schema)
car_emp_df_1.printSchena()
car_emp_df_1.show(20,False)
"""
Note: Here the dept_id is null for second and fourth row
+--------------+--------------------+-------+
|dept          |name                |dept_id|
+--------------+--------------------+-------+
|Finance       |[Tom, Jack, Mark]   |101    |
|Technology    |[Barry, Mike, Jacob]|null   |
|Human_Resource|[Tim, , More]       |301    |
|Administrative|[Edisson, , Kevin]  |null   |
|Contractors   |[Trunp, Mark, Elvin]|501    |
+--------------+--------------------+-------+
"""
car_emp_df_1.withColumn("dept_with_id",concat(car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat ,it return null as dept_id is null for second and fourth row. No concatenation happened there.
+--------------+--------------------+-------+-----------------+
|dept          |name                |dept_id|dept_with_id     |
+--------------+--------------------+-------+-----------------+
|Finance       |[Tom, Jack, Mark]   |101    |Finance101       |
|Technology    |[Barry, Mike, Jacob]|null   |null             |
|Human_Resource|[Tim, , More]       |301    |Human_Resource301|
|Administrative|[Edisson, , Kevin]  |null   |null             |
|Contractors   |[Trunp, Mark, Elvin]|501    |Contractors501   |
+--------------+--------------------+-------+-----------------+
"""
car_emp_df_1.withColumn("dept_with_id",concat_ws("_",car_emp_df_1.dept,car_emp_df_1.dept_id)).show(20,False)
"""
On using concat_ws ,it return non empty colum values( Without dept_id , it returns dept alone making meaning full)
+--------------+--------------------+-------+------------------+
|dept          |name                |dept_id|dept_with_id      |
+--------------+--------------------+-------+------------------+
|Finance       |[Tom, Jack, Mark]   |101    |Finance_101       |
|Technology    |[Barry, Mike, Jacob]|null   |Technology        |
|Human_Resource|[Tim, , More]       |301    |Human_Resource_301|
|Administrative|[Edisson, , Kevin]  |null   |Administrative    |
|Contractors   |[Trunp, Mark, Elvin]|501    |Contractors_501   |
+--------------+--------------------+-------+------------------+
"""

Reference

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply