Apache Spark is an open-source, distributed computing system that can process large amounts of data…
Tag: Spark_Interview
In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table() is used to read a…
PySpark : How to read date datatype from CSV ?
We specify schema = true when a CSV file is being read. Spark determines the data type of a column…
PySpark: How to accept date in a Dataframe : DateType can not accept object ‘YYYY-MM-DD’ in type
Accepting date in a Dataframe When you define a data in a a list of tuple and trying to read…
How to transform columns into list of objects [arrays] on top of group by in PySpark – collect_list and collect_set
In this article we will see how to returns a set of objects in an array with or without duplicate…
Convert data from the PySpark DataFrame columns to Row format or get elements in columns in row
pyspark.sql.functions.collect_list(col) This is an aggregate function and returns a list of objects with duplicates. To retrieve the data from the PySpark…
PySpark: How to add months to a date column in Spark DataFrame (add_months)
I have a use case where I want to add months to a date column in spark DataFrame Function :…
PySpark-How to returns the first column that is not null
pyspark.sql.functions.coalesce If you want to return the first non zero from list of column you can use coalesce function in…
How can you convert PySpark Dataframe to JSON ?
pyspark.sql.DataFrame.toJSON There may be some situation that you need to send your dataframe to a file to a server or…
How can I see the full column values in a Spark Dataframe ?
When we do a dataframe.show () , we can see that some of the column values got truncated. Here we…
What is the difference between repartition() and coalesce() ?
The repartition algorithm will perform a full shuffle and creates new partitions with data that’s distributed evenly. The repartition algorithm makes…