Handle precise numeric data in PySpark : DecimalType

PySpark @ Freshers.in

When precision and accuracy are crucial, the DecimalType data type becomes indispensable. In this comprehensive guide, we’ll explore PySpark’s DecimalType, its applications, use cases, and best practices for handling precise numeric data.

The Need for DecimalType

In data analysis and financial applications, maintaining precision is paramount. Traditional floating-point representations can lead to rounding errors, making DecimalType a valuable tool for ensuring accuracy.

Understanding PySpark’s DecimalType

The DecimalType data type in PySpark represents decimal numbers with fixed precision and scale. It allows you to work with financial data, currency amounts, and other numeric values that require exact precision.

Key Attributes of DecimalType

  • Precision: The total number of digits (both integer and fractional) in a decimal value.
  • Scale: The number of digits to the right of the decimal point.

Example: Handling Financial Transactions

Let’s consider a real-world scenario where you need to work with financial transaction amounts using DecimalType:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DecimalType
from decimal import Decimal
# Initialize SparkSession
spark = SparkSession.builder.appName("DecimalType @ Freshers.in Learning Example").getOrCreate()
# Create a sample dataframe
data = [("Transaction 1", "USD", Decimal("125.75")),
        ("Transaction 2", "EUR", Decimal("340.95")),
        ("Transaction 3", "GBP", Decimal("55.50")),
        ("Transaction 4", "JPY", Decimal("8900.25")),
        ("Transaction 5", "AUD", Decimal("1234.55"))]
# Define a DecimalType with precision 10 and scale 2
decimal_type = DecimalType(10, 2)
schema = StructType([StructField("TransactionName", StringType(), True),
                     StructField("Currency", StringType(), True),
                     StructField("Amount", decimal_type, True)])
df = spark.createDataFrame(data, schema)
# Show the dataframe
df.show()
Output
+---------------+--------+-------+
|TransactionName|Currency| Amount|
+---------------+--------+-------+
|  Transaction 1|     USD| 125.75|
|  Transaction 2|     EUR| 340.95|
|  Transaction 3|     GBP|  55.50|
|  Transaction 4|     JPY|8900.25|
|  Transaction 5|     AUD|1234.55|
+---------------+--------+-------+
Author: user