PySpark : Casting the data type of a series to a specified type

Understanding Series.astype(dtype)

The Series.astype(dtype) method in Pandas-on-Spark allows users to cast the data type of a series to a specified type (dtype). This can be extremely useful when dealing with data processing tasks where the data types need to be consistent or transformed for further analysis.

Syntax:

Series.astype(dtype)

Where:

  • dtype: The data type to which the series will be cast.

Examples:

Let’s dive into some examples to understand how Series.astype(dtype) works in practice.

Casting Series to Numeric Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the float data type.

# Importing necessary librariesfrom pyspark.sql import SparkSessionimport pandas as pdimport numpy as np# Creating a SparkSessionspark = SparkSession.builder \    .appName("Pandas-on-Spark @ Freshers.in") \    .getOrCreate()# Creating a Pandas DataFramedata = {'numbers': ['10.5', '20.7', '30.9', '40.2']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'numbers' column to float data typesdf['numbers'] = sdf['numbers'].astype(float)# Displaying the resultsdf.show()

Output:

+-------+|numbers|+-------+|   10.5||   20.7||   30.9||   40.2|+-------+

Casting Series to Categorical Data Type

Suppose we have a Pandas-on-Spark series containing categorical data, and we want to convert it to the category data type.

# Creating a Pandas DataFrame with categorical datadata = {'categories': ['A', 'B', 'C', 'A', 'B', 'C']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'categories' column to category data typesdf['categories'] = sdf['categories'].astype('category')# Displaying the resultsdf.show()

Output:

+----------+|categories|+----------+|         A||         B||         C||         A||         B||         C|+----------+

Casting Series to Integer Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the integer data type.

# Creating a Pandas DataFrame with numerical data in string formatdata = {'numbers': ['10', '20', '30', '40']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'numbers' column to integer data typesdf['numbers'] = sdf['numbers'].astype(int)# Displaying the resultsdf.show()

Output:

+-------+|numbers|+-------+|     10||     20||     30||     40|+-------+