PySpark : Create an MD5 hash of a certain string column in PySpark.

PySpark @ Freshers.in

Introduction to MD5 Hash

MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value. It is commonly used to check the integrity of files. However, MD5 is not collision-resistant; as of 2021, it is possible to find different inputs that hash to the same output, which makes it unsuitable for functions such as SSL certificates or encryption that require a high degree of security.

An MD5 hash is typically expressed as a 32-digit hexadecimal number.

Use of MD5 Hash in PySpark

Yes, you can use PySpark to generate a 32-character hex-encoded string containing the 128-bit MD5 message digest. PySpark does not have a built-in MD5 function, but you can easily use Python’s built-in libraries to create a User Defined Function (UDF) for this purpose.

Here is how you can create an MD5 hash of a certain string column in PySpark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import hashlib

#Create a Spark session
spark = SparkSession.builder.appName("freshers.in Learning MD5 hash ").getOrCreate()

#Creating sample data
data = [("Sachin",), ("Ramesh",), ("Krishna",)]
df = spark.createDataFrame(data, ["Name"])

#Function for generating MD5 hash
def md5_hash(input):
    return hashlib.md5(input.encode('utf-8')).hexdigest()

#UDF for the MD5 function
md5_udf = udf(lambda z: md5_hash(z), StringType())

#Apply the above UDF to the DataFrame
df_hashed = df.withColumn("Name_hashed", md5_udf(df['Name']))

df_hashed.show(20,False)

In this example, we first create a Spark session and a DataFrame df with a single column “Name”. Then, we define the function md5_hash to generate an MD5 hash of an input string. After that, we create a user-defined function (UDF) md5_udf using PySpark SQL functions. Finally, we apply this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed with the MD5 hashed values of the names.

Output

+----+--------------------------------+
|Name|Name_hashed                     |
+----+--------------------------------+
|John|61409aa1fd47d4a5332de23cbf59a36f|
|Jane|2b95993380f8be6bd4bd46bf44f98db9|
|Mike|1b83d5da74032b6a750ef12210642eea|
+----+--------------------------------+

Spark important urls to refer

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply