PySpark : Introduction to BASE64_ENCODE and its Applications in PySpark

PySpark @ Freshers.in

Introduction to BASE64_ENCODE and its Applications in PySpark

BASE64 is a group of similar binary-to-text encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation. It is designed to carry data stored in binary formats across channels that are designed to deal with text. This ensures that the data remains intact without any modification during transport.

BASE64_ENCODE is a function used to encode data into this base64 format.

Where is BASE64_ENCODE used?

Base64 encoding schemes are commonly used when there is a need to encode binary data, especially when that data needs to be stored or sent over media that are designed to deal with text. This encoding helps to ensure that the data remains intact without modification during transport.

Base64 is used commonly in a number of applications including email via MIME, as well as storing complex data in XML or JSON.

Advantages of BASE64_ENCODE

  1. Data Integrity: Base64 ensures that data remains intact without modification during transport.
  2. Usability: It can be used to send binary data, such as images or files, over channels designed to transmit text-based data.
  3. Security: While it’s not meant to be a secure encryption method, it does provide a layer of obfuscation.

How to Encode the Input Using Base64 Encoding in PySpark

PySpark, the Python library for Spark programming, does not natively support Base64 encoding functions until the version that’s available as of my knowledge cutoff in September 2021. However, PySpark can easily use Python’s built-in libraries, and we can create a User Defined Function (UDF) to perform Base64 encoding. Below is a sample way of how you can achieve that.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import base64

def base64_encode(input):
    try:
        return base64.b64encode(input.encode('utf-8')).decode('utf-8')
    except Exception as e:
        return None

base64_encode_udf = udf(lambda z: base64_encode(z), StringType())

df_encoded = df.withColumn('encoded_column', base64_encode_udf(df['column_to_encode']))
Example with Data
The BASE64_ENCODE function is a handy tool for preserving binary data integrity when it needs to be stored and transferred over systems that are designed to handle text.
# Import the required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import base64
# Start a Spark Session
spark = SparkSession.builder.appName("freshers.in Learning for BASE64_ENCODE ").getOrCreate()
# Create a sample DataFrame
data = [('Sachin', 'Tendulkar', 'sachin.tendulkar@freshers.in'),
        ('Mahesh', 'Babu', 'mahesh.babu@freshers.in'),
        ('Mohan', 'Lal', 'mohan.lal@freshers.in')]
df = spark.createDataFrame(data, ["First Name", "Last Name", "Email"])
# Display original DataFrame
df.show(20,False)
# Define the base64 encode function
def base64_encode(input):
    try:
        return base64.b64encode(input.encode('utf-8')).decode('utf-8')
    except Exception as e:
        return None
# Create a UDF for the base64 encode function
base64_encode_udf = udf(lambda z: base64_encode(z), StringType())
# Add a new column to the DataFrame with the encoded email
df_encoded = df.withColumn('Encoded Email', base64_encode_udf(df['Email']))
# Display the DataFrame with the encoded column
df_encoded.show(20,False)
Output
+----------+---------+----------------------------+
|First Name|Last Name|Email                       |
+----------+---------+----------------------------+
|Sachin    |Tendulkar|sachin.tendulkar@freshers.in|
|Mahesh    |Babu     |mahesh.babu@freshers.in     |
|Mohan     |Lal      |mohan.lal@freshers.in       |
+----------+---------+----------------------------+

+----------+---------+----------------------------+----------------------------------------+
|First Name|Last Name|Email                       |Encoded Email                           |
+----------+---------+----------------------------+----------------------------------------+
|Sachin    |Tendulkar|sachin.tendulkar@freshers.in|c2FjaGluLnRlbmR1bGthckBmcmVzaGVycy5pbg==|
|Mahesh    |Babu     |mahesh.babu@freshers.in     |bWFoZXNoLmJhYnVAZnJlc2hlcnMuaW4=        |
|Mohan     |Lal      |mohan.lal@freshers.in       |bW9oYW4ubGFsQGZyZXNoZXJzLmlu            |
+----------+---------+----------------------------+----------------------------------------+

In this script, we first create a SparkSession, which is the entry point to any functionality in Spark. We then create a DataFrame with some sample data.

The base64_encode function takes an input string and returns the Base64 encoded version of the string. We then create a user-defined function (UDF) out of this, which can be applied to our DataFrame.

Finally, we create a new DataFrame, df_encoded, which includes a new column ‘Encoded Email’. This column is the result of applying our UDF to the ‘Email’ column of the original DataFrame.

When you run the df.show() and df_encoded.show(), it will display the original and the base64 encoded DataFrames respectively.

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply