Apache Spark interview questions

user March 7, 2021 Leave a Comment

22. Which file systems does Spark support?
Hadoop Distributed File System (HDFS)
Local File system
S3

23. What is ‘YARN’?
‘YARN’ is a large-scale, distributed operating system for big data applications. It is one of the key features of Spark, providing a central and resource management platform to deliver scalable operations across the cluster.

24. List the benefits of Spark over MapReduce.
Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce.
Unlike MapReduce, Spark provides in-built libraries to perform multiple tasks form the same core; like batch processing, steaming, machine learning, interactive SQL queries among others.
MapReduce is highly disk-dependent whereas Spark promotes caching and in-memory data storage
Spark is capable of iterative computation while MapReduce is not.
Additionally, Spark stores data in-memory whereas Hadoop stores data on the disk. Hadoop uses replication to achieve fault tolerance while Spark uses a different data storage model, resilient distributed datasets (RDD). It also uses a clever way of guaranteeing fault tolerance that minimizes network input and output.
-Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage

25. What is a ‘Spark Executor’?
When ‘SparkContext’ connects to a cluster manager, it acquires an ‘Executor’ on the cluster nodes. ‘Executors’ are Spark processes that run computations and store the data on the worker node. The final tasks by ‘SparkContext’ are transferred to executors.

26. List the various types of ‘Cluster Managers’ in Spark.
The Spark framework supports three major types of Cluster Managers:
a. Standalone: a basic manager to set up a cluster
b. Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
c. Yarn: responsible for resource management in Hadoop

27. What is a ‘worker node’?
‘Worker node’ refers to any node that can run the application code in a cluster.

28. Define ‘PageRank’.
‘PageRank’ is the measure of each vertex in a graph.A unique feature and algorithm in graph, PageRank is the measure of each vertex in the graph. For instance, an edge from u to v represents endorsement of v’s importance by u. In simple terms, if a user at Instagram is followed massively, it will rank high on that platform.

Post Views: 325

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Big Data, software_engineering, Technical

Author: user

Website

Related Articles

dbt (data build tool) interview questions

Snowflake interview questions

Data communication interview questions

OOPS interview questions for freshers and experienced

Digital Electronics interview questions

Algorithm interview questions

Amazon Redshift interview questions

AWS Lambda interview questions

Post navigation

What are the Best Practices when using Snowflake Transactions? →
← Apache PIG interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Featured Posts – Slider Widget

AWS EC2 vs Azure Virtual Machines

Production and Industrial Engineering

Engineering Technical campus placement question and answers

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Related Posts

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Spark : Advantages of Google's Serverless Spark
Google's Serverless Spark has several advantages compared to traditional Spark clusters: Cost-effective: Serverless Spark eliminates…

How do you break a lineage in Apache Spark ? Why we need to break a lineage in Apache Spark ?
In Apache Spark, a lineage refers to the series of RDD (Resilient Distributed Dataset) operations…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

PySpark : Inserting row in Apache Spark Dataframe.
In PySpark, you can insert a row into a DataFrame by first converting the DataFrame…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

PySpark : Explain in detail whether Apache Spark SQL lazy or not ?
Yes, Apache Spark SQL is lazy. In Spark, the concept of "laziness" refers to the…

In pyspark what is the difference between Spark spark.table() and spark.read.table()
In PySpark, spark.table() is used to read a table from the Spark catalog, whereas spark.read.table()…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

How to delete a partition data as well from Hive external table on DROP command?

PySpark : Connecting and updating postgres table in spark SQL

Copyright © 2024 Freshers.in