How to drop nulls in a dataframe : PySpark

PySpark @ Freshers.in

For most of the data cleansing the first thing that you may need to do drop the nulls in the data set . In Pyspark we have “drop” function in class pyspark.sql.DataFrameNaFunctions. Lets see how to use this 

class pyspark.sql.DataFrameNaFunctions

Syntax : drop(how=’any’, thresh=None, subset=None)

The drop will return new dataframe by discarding the nulls in the rows.

how :- How do you want to drop the row
“any” : Drop a row if it contains any nulls.
“all” : Drop a row only if all its values are null.

thresh :- Value should be an integer int.  Default will be None . If thresh is specified, then the dataframe will drop rows that have less than thresh non-null values. This thresh value will overwrites the “how” parameter. For example if we give df.na.drop(thresh=1) , then this iterates through all the rows and then will keeps each row that has at least 1 non-null values. Which means all rows have at least one non-null value.

subset :- If you give subset , then in a list you can give the column names that need to consider.

Lets give examples for all the above

#Create sample data 
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import IntegerType,StringType,FloatType,BooleanType
veh_data = ([
(1,"Cadillac",2022,28.0,"USA",True),\
(2,"Tesla",2022,80.0,"USA",False),\
(3,"Lincoln",2021,25.5,None,True),\
(None,None,None,None,None,None),\
(4,"Ford",2022,None,"USA",None),\
(5,"Benz",2020,20.0,"USA",False)])
veh_schema=StructType([
StructField("si_no",IntegerType(),True),
StructField("vehicle_make",StringType(),True),
StructField("make_year",IntegerType(),True),
StructField("price",FloatType(),True),
StructField("county",StringType(),True),
StructField("available_statue",BooleanType(),True)])
veh_df = spark.createDataFrame(data=veh_data,schema=veh_schema)
veh_df.show(20,False)
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|1    |Cadillac    |2022     |28.0 |USA   |true            |
|2    |Tesla       |2022     |80.0 |USA   |false           |
|3    |Lincoln     |2021     |25.5 |null  |true            |
|null |null        |null     |null |null  |null            |
|4    |Ford        |2022     |null |USA   |null            |
|5    |Benz        |2020     |20.0 |USA   |false           |
+-----+------------+---------+-----+------+----------------+

We are going to drop all rows having null value

veh_df.na.drop().show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

We will use how = “any”, if any rows having null will be removed, the result will be same as above.

veh_df.na.drop(how='any').show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

We need to drop if all the rows are null

veh_df.na.drop(how='all').show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    3|     Lincoln|     2021| 25.5|  null|            true|
|    4|        Ford|     2022| null|   USA|            null|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

If we need to drop a row if any particular column contains null. In the below, if “price” column is null , then that rows will be dropped.

veh_df.na.drop(subset=["price"]).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    3|     Lincoln|     2021| 25.5|  null|            true|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

Same as the above, but here we mention two column names.

veh_df.na.drop(subset=["price","county"]).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

Thresh : We need to drop if the total number of not null is less that the limit mentioned. If we give thresh=5, then if the row contains less than 5 not null , that row will be 

veh_df.na.drop(thresh=5).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    3|     Lincoln|     2021| 25.5|  null|            true|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+
veh_df.na.drop(thresh=6).show()
+-----+------------+---------+-----+------+----------------+
|si_no|vehicle_make|make_year|price|county|available_statue|
+-----+------------+---------+-----+------+----------------+
|    1|    Cadillac|     2022| 28.0|   USA|            true|
|    2|       Tesla|     2022| 80.0|   USA|           false|
|    5|        Benz|     2020| 20.0|   USA|           false|
+-----+------------+---------+-----+------+----------------+

FYI : This kind of things or the inverse can also be accomplished by isNotNull() or/and isNull()

See the below similar posts

How to replace null value for all columns or for each column separately ?

How to replace a value with another value in a column in Pyspark Dataframe ?

Author: user

Leave a Reply