PySpark : A Comprehensive Guide to Converting Expressions to Fixed-Point Numbers in PySpark

Among PySpark’s numerous features, one that stands out is its ability to convert input expressions into fixed-point numbers. This feature comes in handy when dealing with data that requires a high level of precision or when we want to control the decimal places of numbers to maintain consistency across datasets.

In this article, we will walk you through a detailed explanation of how to convert input expressions to fixed-point numbers using PySpark. Note that PySpark’s fixed-point function, when given a NULL input, will output NULL.

Understanding Fixed-Point Numbers

Before we get started, it’s essential to understand what fixed-point numbers are. A fixed-point number has a specific number of digits before and after the decimal point. Unlike floating-point numbers, where the decimal point can ‘float’, in fixed-point numbers, the decimal point is ‘fixed’.

PySpark’s Fixed-Point Function

PySpark uses the cast function combined with the DecimalType function to convert an expression to a fixed-point number. DecimalType allows you to specify the total number of digits as well as the number of digits after the decimal point.

Here is the syntax for converting an expression to a fixed-point number:

from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType
df.withColumn("fixed_point_column", col("input_column").cast(DecimalType(precision, scale)))

In the above code:

df is the DataFrame.
fixed_point_column is the new column with the fixed-point number.
input_column is the column you want to convert.
precision is the total number of digits.
scale is the number of digits after the decimal point.

A Practical Example

Let’s work through an example to demonstrate this.

Firstly, let’s initialize a PySpark session and create a DataFrame:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType
spark = SparkSession.builder.appName("FixedPointNumbers").getOrCreate()
data = [("Sachin", 10.123456), ("James", 20.987654), ("Smitha ", 30.111111), (None, None)]
df = spark.createDataFrame(data, ["Name", "Score"])
df.show()
+-------+---------+
|   Name|    Score|
+-------+---------+
| Sachin|10.123456|
|  James|20.987654|
|Smitha |30.111111|
|   null|     null|
+-------+---------+

Next, let’s convert the ‘Score’ column to a fixed-point number with a total of 5 digits, 2 of which are after the decimal point:

df = df.withColumn("Score", col("Score").cast(DecimalType(5, 2)))
df.show()
+-------+-----+
|   Name|Score|
+-------+-----+
| Sachin|10.12|
|  James|20.99|
|Smitha |30.11|
|   null| null|
+-------+-----+

The score column values are now converted into fixed-point numbers. Notice how the NULL value remained NULL after the conversion, which adheres to PySpark’s rule of NULL input leading to NULL output.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply