Introduction to MD5 Hash
MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value. It is commonly used to check the integrity of files. However, MD5 is not collision-resistant; as of 2021, it is possible to find different inputs that hash to the same output, which makes it unsuitable for functions such as SSL certificates or encryption that require a high degree of security.
An MD5 hash is typically expressed as a 32-digit hexadecimal number.
Use of MD5 Hash in PySpark
Yes, you can use PySpark to generate a 32-character hex-encoded string containing the 128-bit MD5 message digest. PySpark does not have a built-in MD5 function, but you can easily use Python’s built-in libraries to create a User Defined Function (UDF) for this purpose.
Here is how you can create an MD5 hash of a certain string column in PySpark.
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import hashlib
#Create a Spark session
spark = SparkSession.builder.appName("freshers.in Learning MD5 hash ").getOrCreate()
#Creating sample data
data = [("Sachin",), ("Ramesh",), ("Krishna",)]
df = spark.createDataFrame(data, ["Name"])
#Function for generating MD5 hash
def md5_hash(input):
return hashlib.md5(input.encode('utf-8')).hexdigest()
#UDF for the MD5 function
md5_udf = udf(lambda z: md5_hash(z), StringType())
#Apply the above UDF to the DataFrame
df_hashed = df.withColumn("Name_hashed", md5_udf(df['Name']))
df_hashed.show(20,False)
In this example, we first create a Spark session and a DataFrame df with a single column “Name”. Then, we define the function md5_hash to generate an MD5 hash of an input string. After that, we create a user-defined function (UDF) md5_udf using PySpark SQL functions. Finally, we apply this UDF to the column “Name” in the DataFrame df and create a new DataFrame df_hashed with the MD5 hashed values of the names.
Output
+----+--------------------------------+
|Name|Name_hashed |
+----+--------------------------------+
|John|61409aa1fd47d4a5332de23cbf59a36f|
|Jane|2b95993380f8be6bd4bd46bf44f98db9|
|Mike|1b83d5da74032b6a750ef12210642eea|
+----+--------------------------------+
Spark important urls to refer