Category: spark

Spark User full article

PySpark @

PySpark how to get rows having nulls for a column or columns without nulls or count of Non null

pyspark.sql.Column.isNotNull isNotNull() : True if the current expression is NOT null. isNull() : True if the current expression is null. With…

PySpark @

PySpark – groupby with aggregation (count, sum, mean, min, max)

pyspark.sql.DataFrame.groupBy PySpark groupby functions groups the DataFrame using the specified columns to run aggregation ( count,sum,mean, min, max) on them….

PySpark @

PySpark filter : How to filter data in Pyspark – Multiple options explained.

pyspark.sql.DataFrame.filter PySpark filter function is used to filter the data in a Spark Data Frame, in short used to cleansing…

PySpark @

How to concatenate multiple columns in a Spark dataframe

concat_ws : With concat_ws () function you can  concatenates multiple input string columns together into a single string column, using…

PySpark @

PySpark-How to create and RDD from a List and from AWS S3

In this article you will learn , what an RDD is ?  How can we create an RDD from a…

PySpark @

How to run dataframe as Spark SQL – PySpark

If you have a situation that you can easily get the result using SQL/ SQL already existing , then you…

PySpark @

How to get all combination of columns using PySpark? What is Cube in Spark ?

A cube is a multi-dimensional generalization of a two- or three-dimensional spreadsheet. Cube is a shorthand for multidimensional dataset, given…

How to remove csv header using Spark (PySpark)

A common use case when dealing with CSV file is to remove the header from the source to do data…