PySpark : Casting the data type of a series to a specified type


Understanding Series.astype(dtype)

The Series.astype(dtype) method in Pandas-on-Spark allows users to cast the data type of a series to a specified type (dtype). This can be extremely useful when dealing with data processing tasks where the data types need to be consistent or transformed for further analysis.




  • dtype: The data type to which the series will be cast.


Let’s dive into some examples to understand how Series.astype(dtype) works in practice.

Casting Series to Numeric Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the float data type.

# Importing necessary libraries
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np

# Creating a SparkSession
spark = SparkSession.builder \
    .appName("Pandas-on-Spark @") \

# Creating a Pandas DataFrame
data = {'numbers': ['10.5', '20.7', '30.9', '40.2']}
pdf = pd.DataFrame(data)

# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Converting the 'numbers' column to float data type
sdf['numbers'] = sdf['numbers'].astype(float)

# Displaying the result


|   10.5|
|   20.7|
|   30.9|
|   40.2|

Casting Series to Categorical Data Type

Suppose we have a Pandas-on-Spark series containing categorical data, and we want to convert it to the category data type.

# Creating a Pandas DataFrame with categorical data
data = {'categories': ['A', 'B', 'C', 'A', 'B', 'C']}
pdf = pd.DataFrame(data)

# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Converting the 'categories' column to category data type
sdf['categories'] = sdf['categories'].astype('category')

# Displaying the result


|         A|
|         B|
|         C|
|         A|
|         B|
|         C|

Casting Series to Integer Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the integer data type.

# Creating a Pandas DataFrame with numerical data in string format
data = {'numbers': ['10', '20', '30', '40']}
pdf = pd.DataFrame(data)
# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Converting the 'numbers' column to integer data type
sdf['numbers'] = sdf['numbers'].astype(int)
# Displaying the result


|     10|
|     20|
|     30|
|     40|
Author: user