Recent Posts

good to read @Freshers.in

MapReduce vs. Spark – A Comprehensive Guide with example

MapReduce and Spark are two widely-used big data processing frameworks. MapReduce was introduced by Google in 2004, while Spark was…

PySpark @ Freshers.in

PySpark : Dropping duplicate rows in Pyspark – A Comprehensive Guide with example

PySpark provides several methods to remove duplicate rows from a dataframe. In this article, we will go over the steps…

PySpark @ Freshers.in

PySpark : Replacing null column in a PySpark dataframe to 0 or any value you wish.

To replace null values in a PySpark DataFrame column that contain null with a numeric value (e.g., 0), you can…

AWS Redshift @ Freshers.in

Redshift : Learn how to link Amazon redshift to s3 bucket

Amazon Redshift is a fully-managed, petabyte-scale data warehouse service in the cloud. It allows you to run complex analytical queries…

PySpark @ Freshers.in

PySpark : unix_timestamp function – A comprehensive guide

One of the key functionalities of PySpark is the ability to transform data into the desired format. In some cases,…

PySpark @ Freshers.in

PySpark : Reading parquet file stored on Amazon S3 using PySpark

To read a Parquet file stored on Amazon S3 using PySpark, you can use the following code: from pyspark.sql import…

AWS Redshift @ Freshers.in

Redshift : Role of VACUUM and ANALYZE in Redshift

Amazon Redshift is a popular data warehousing solution that is widely used by businesses to manage and analyze large volumes…

Google DataFlow @ Freshers.in

Google Dataflow : Handling Late Data in Google Dataflow

Handling late-arriving data is a common challenge when working with streaming data processing systems like Google Dataflow. Late data refers…