Converting numerical strings from one base to another within DataFrames : conv

PySpark @ Freshers.in

The conv function in PySpark simplifies the process of converting numerical strings from one base to another within DataFrames. With this function, converting between common bases like binary, decimal, and hexadecimal becomes straightforward and easily integrated into data transformation pipelines in Spark. The conv function in PySpark simplifies the process of converting numerical strings from one base to another within DataFrames.

Handling numbers in different bases becomes quite common in data manipulation, especially when dealing with binary, decimal, and hexadecimal numbers. PySpark SQL offers the conv function, a convenient function that can convert a number from one base to another. This article provides an example of using the conv function with hardcoded values in a PySpark DataFrame.

Start a PySpark Session:

All PySpark applications need a SparkSession to start. Initialize a SparkSession first:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Base Conversion using conv : Learning @ Freshers.in") \
    .getOrCreate()

Create a DataFrame with sample data:

Construct a PySpark DataFrame with a column of numbers in string format, which are in binary:

from pyspark.sql import Row
data = [Row(binary_string="10"), Row(binary_string="101"), Row(binary_string="1111"), Row(binary_string="10010")]
df = spark.createDataFrame(data)
df.show()

This code creates a DataFrame with binary numbers represented as strings.

Use the ‘conv’ function for base conversion:

The conv function is part of the PySpark SQL functions, and it requires three parameters: the number as a string, the current base of the number, and the base you want to convert to. Here’s how to use it:

from pyspark.sql.functions import conv
df_converted = df.select(conv(df["binary_string"], 2, 16).alias("hex_string"))
df_converted.show()

This command converts the binary numbers (base 2) in the “binary_string” column to hexadecimal (base 16) and shows the DataFrame.

Terminate the PySpark session:

Don’t forget to stop the SparkSession at the end to release resources:

spark.stop()

Output

+-------------+
|binary_string|
+-------------+
|           10|
|          101|
|         1111|
|        10010|
+-------------+

+----------+
|hex_string|
+----------+
|         2|
|         5|
|         F|
|        12|
+----------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user