Installing Spark on a Linux machine can be done in a few steps. The following…
Tag: big_data_interview
How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations that are performed on a…
When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark, which is a powerful open-source…
PySpark : How to create a map from a column of structs : map_from_entries
pyspark.sql.functions.map_from_entries map_from_entries(col) is a function in PySpark that creates a map from a column of structs, where the structs have…
PySpark : Converting Unix timestamp to a string representing the timestamp in a specific format
pyspark.sql.functions.from_unixtime The “from_unixtime()” function is a PySpark function that allows you to convert a Unix timestamp (a long integer representing…
PySpark : Check if two or more arrays in a DataFrame column have any common elements
pyspark.sql.functions.arrays_overlap The arrays_overlap function is a PySpark function that allows you to check if two or more arrays in a…
PySpark : Combine the elements of two or more arrays in a DataFrame column
pyspark.sql.functions.array_union The array_union function is a PySpark function that allows you to combine the elements of two or more arrays…
PySpark : Sort an array of elements in a DataFrame column
pyspark.sql.functions.array_sort The array_sort function is a PySpark function that allows you to sort an array of elements in a DataFrame…
PySpark : How to sort a dataframe column in ascending order while putting the null values first ?
pyspark.sql.Column.asc_nulls_first In PySpark, the asc_nulls_first() function is used to sort a column in ascending order while putting the null values…
PySpark : How to number up to the nearest integer
pyspark.sql.functions.ceil In PySpark, the ceil() function is used to round a number up to the nearest integer. This function is…
Learn about PySparks broadcast variable with example
In PySpark, the broadcast variable is used to cache a read-only variable on all the worker nodes, which can be…