PySpark : Converting arguments to numeric types

Spark_Pandas_Freshers_in

In PySpark, the Pandas API provides a range of functionalities, including the to_numeric() function, which allows for converting arguments to numeric types. This article explores the usage, syntax, and practical applications of to_numeric() with detailed examples.

Understanding to_numeric()

The to_numeric() function in the Pandas API on Spark converts argument values to numeric type, facilitating data manipulation and analysis. It offers flexibility in handling errors during conversion, enhancing data integrity and reliability.

Syntax

The syntax for to_numeric() is as follows:

pandas.to_numeric(arg, errors='raise')

Here, arg represents the argument to be converted to a numeric type, and errors (optional) specifies how errors should be handled during conversion.

Examples

Let’s explore various scenarios to understand the functionality of to_numeric():

Example 1: Basic Conversion

import pandas as pd
# Define a list of strings
data = ['10', '20', '30', '40']
# Convert strings to numeric type
numeric_data = pd.to_numeric(data)
print(numeric_data)
# Output: [10, 20, 30, 40]

Example 2: Handling Errors

import pandas as pd
# Define a list of strings with an invalid value
data = ['10', '20', '30', 'invalid']
# Convert strings to numeric type with errors='coerce'
numeric_data = pd.to_numeric(data, errors='coerce')
print(numeric_data)
# Output: [10.0, 20.0, 30.0, NaN]

Example 3: Using with Spark DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create a SparkSession
spark = SparkSession.builder \
    .appName("BooleanExpression Example : Learning @ Freshers.in ") \
    .getOrCreate()

# Sample data
data = [(1, 15), (2, 25), (3, 35), (4, 45)]
columns = ["id", "value"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Perform a filter operation using '&' for 'and' operator
filtered_df = df.filter((col("id") > 2) & (col("value") < 40))

# Show the filtered DataFrame
filtered_df.show()

Output

+---+-----+
| id|value|
+---+-----+
|  3|   35|
+---+-----+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user