PySpark : Casting the data type of a series to a specified type

Spark_Pandas_Freshers_in

Understanding Series.astype(dtype)

The Series.astype(dtype) method in Pandas-on-Spark allows users to cast the data type of a series to a specified type (dtype). This can be extremely useful when dealing with data processing tasks where the data types need to be consistent or transformed for further analysis.

Syntax:

Series.astype(dtype)

Where:

  • dtype: The data type to which the series will be cast.

Examples:

Let’s dive into some examples to understand how Series.astype(dtype) works in practice.

Casting Series to Numeric Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the float data type.

# Importing necessary libraries
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np

# Creating a SparkSession
spark = SparkSession.builder \
    .appName("Pandas-on-Spark @ Freshers.in") \
    .getOrCreate()

# Creating a Pandas DataFrame
data = {'numbers': ['10.5', '20.7', '30.9', '40.2']}
pdf = pd.DataFrame(data)

# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Converting the 'numbers' column to float data type
sdf['numbers'] = sdf['numbers'].astype(float)

# Displaying the result
sdf.show()

Output:

+-------+
|numbers|
+-------+
|   10.5|
|   20.7|
|   30.9|
|   40.2|
+-------+

Casting Series to Categorical Data Type

Suppose we have a Pandas-on-Spark series containing categorical data, and we want to convert it to the category data type.

# Creating a Pandas DataFrame with categorical data
data = {'categories': ['A', 'B', 'C', 'A', 'B', 'C']}
pdf = pd.DataFrame(data)

# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)

# Converting the 'categories' column to category data type
sdf['categories'] = sdf['categories'].astype('category')

# Displaying the result
sdf.show()

Output:

+----------+
|categories|
+----------+
|         A|
|         B|
|         C|
|         A|
|         B|
|         C|
+----------+

Casting Series to Integer Data Type

Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the integer data type.

# Creating a Pandas DataFrame with numerical data in string format
data = {'numbers': ['10', '20', '30', '40']}
pdf = pd.DataFrame(data)
# Converting Pandas DataFrame to Spark DataFrame
sdf = spark.createDataFrame(pdf)
# Converting the 'numbers' column to integer data type
sdf['numbers'] = sdf['numbers'].astype(int)
# Displaying the result
sdf.show()

Output:

+-------+
|numbers|
+-------+
|     10|
|     20|
|     30|
|     40|
+-------+
Author: user