Understanding Series.astype(dtype)
The Series.astype(dtype)
method in Pandas-on-Spark allows users to cast the data type of a series to a specified type (dtype
). This can be extremely useful when dealing with data processing tasks where the data types need to be consistent or transformed for further analysis.
Syntax:
Series.astype(dtype)
Where:
dtype
: The data type to which the series will be cast.
Examples:
Let’s dive into some examples to understand how Series.astype(dtype)
works in practice.
Casting Series to Numeric Data Type
Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the float
data type.
# Importing necessary librariesfrom pyspark.sql import SparkSessionimport pandas as pdimport numpy as np# Creating a SparkSessionspark = SparkSession.builder \ .appName("Pandas-on-Spark @ Freshers.in") \ .getOrCreate()# Creating a Pandas DataFramedata = {'numbers': ['10.5', '20.7', '30.9', '40.2']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'numbers' column to float data typesdf['numbers'] = sdf['numbers'].astype(float)# Displaying the resultsdf.show()
Output:
+-------+|numbers|+-------+| 10.5|| 20.7|| 30.9|| 40.2|+-------+
Casting Series to Categorical Data Type
Suppose we have a Pandas-on-Spark series containing categorical data, and we want to convert it to the category
data type.
# Creating a Pandas DataFrame with categorical datadata = {'categories': ['A', 'B', 'C', 'A', 'B', 'C']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'categories' column to category data typesdf['categories'] = sdf['categories'].astype('category')# Displaying the resultsdf.show()
Output:
+----------+|categories|+----------+| A|| B|| C|| A|| B|| C|+----------+
Casting Series to Integer Data Type
Suppose we have a Pandas-on-Spark series containing numerical data in string format, and we want to convert it to the integer
data type.
# Creating a Pandas DataFrame with numerical data in string formatdata = {'numbers': ['10', '20', '30', '40']}pdf = pd.DataFrame(data)# Converting Pandas DataFrame to Spark DataFramesdf = spark.createDataFrame(pdf)# Converting the 'numbers' column to integer data typesdf['numbers'] = sdf['numbers'].astype(int)# Displaying the resultsdf.show()
Output:
+-------+|numbers|+-------+| 10|| 20|| 30|| 40|+-------+
Spark important urls to refer