Pyspark : Formating the arguments in printf-style and returns the result as a string column

PySpark @ Freshers.in

pyspark.sql.functions.format_string

‘format_string’ is a parameter in the select method of a DataFrame in PySpark. It is used to specify the output format of the columns in the resulting DataFrame.

Here is a full code example that demonstrates the use of the ‘format_string’ parameter in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import format_string
# Create a SparkSession
spark = SparkSession.builder.appName("format_string_example").getOrCreate()
# Create a sample DataFrame
data = [(1, "foo", 3.14), (2, "bar", 2.71), (3, "baz", 1.41)]
columns = ["id", "name", "value"]
df = spark.createDataFrame(data, columns)
df.show()

Input Dataframe

+---+----+-----+
| id|name|value|
+---+----+-----+
|  1| foo| 3.14|
|  2| bar| 2.71|
|  3| baz| 1.41|
+---+----+-----+

Use the ‘format_string’ parameter to specify the output format

df2 = df.select("id", "name", "value").select(
    format_string("%05d", "id").alias("ID"),
    format_string("%10s", "name").alias("NAME"),
    format_string("%.2f", "value").alias("VALUE")
)
df2.show()

Result

+-----+----------+-----+
|   ID|      NAME|VALUE|
+-----+----------+-----+
|00001|       foo| 3.14|
|00002|       bar| 2.71|
|00003|       baz| 1.41|
+-----+----------+-----+

In this example, we create a sample DataFrame, df with columns “id”, “name”, and “value”. We then use the select method and the ‘format_string’ parameter to specify that the output for column “id” should be an integer with a minimum width of 5 digits and zero-padded, column “name” should be formatted as a string with a maximum length of 10 characters, and column “value” should be a floating point number with 2 decimal places. The resulting DataFrame will be displayed with the specified format.

Additional notes:

For example, suppose you have a DataFrame called df with columns “A”, “B”, and “C”. You can use the ‘format_string’ parameter to specify that the output for column “A” should be a string with a maximum length of 10 characters, column “B” should be formatted as a floating point number with 2 decimal places, and column “C” should be an integer.

df.select("A", "B", "C").select(
    format_string("%10s", "A").alias("A"),
    format_string("%.2f", "B").alias("B"),
    format_string("%d", "C").alias("C")
).show()

The string passed to ‘format_string’ is similar to the one used in python’s string formatting. For example, in the above format_string(“%10s”, “A”) %10s is the format string and A is the column name.

The above example will output the DataFrame with columns “A”, “B”, and “C” with the specified format.

Spark important urls

  1. Spark Examples
  2. PySpark Blogs
  3. Bigdata Blogs
  4. Spark Interview Questions
  5. Official Page
Author: user

Leave a Reply