Apache Spark interview questions

PySpark @ Freshers.in

15. Name a few commonly used Spark Ecosystems.
Spark SQL (Shark), Spark Streaming, GraphX, MLlib. SparkR

16. What is ‘Spark Streaming’?
Spark supports stream processing, essentially an extension to the Spark API. This allows stream processing of live data streams. The data from different sources like Flume and HDFS is streamed and processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.
Business use cases for Spark streaming: Each Spark component has its own use case. Whenever you want to analyze data with the latency of less than 15 minutes and greater than 2 minutes i.e. near real time is when you use Spark streaming

17. What is ‘GraphX’ in Spark?
‘GraphX’ is a component in Spark which is used for graph processing. It helps to build and transform interactive graphs.

18. What is the function of ‘MLlib’?
‘MLlib’ is Spark’s machine learning library. It aims at making machine learning easy and scalable with common learning algorithms and real-life use cases including clustering, regression filtering, and dimensional reduction among others.

19. What is ‘Spark SQL’?
Spark SQL is a Spark interface to work with structured as well as semi-structured data. It has the capability to load data from multiple structured sources like ‘textfiles’, JSON files, Parquet files, among others. Spark SQL provides a special type of RDD called SchemaRDD. These are row objects, where each object represents a record.
Here’s how you can create an SQL context in Spark SQL:
SQL context: scala> var sqlContext=new SqlContext
HiveContext: scala> var hc = new HIVEContext(sc)

20. What is a ‘Parquet’ in Spark?
‘Parquet’ is a columnar format file supported by many data processing systems. Spark SQL performs both read and write operations with the ‘Parquet’ file.
Organizing by column allows for better compression, as data is more homogeneous.
I/O will be reduced as we can efficiently scan only a subset of the columns while reading the data.
Better compression also reduces the bandwidth required to read the input.
As we store data of the same type in each column, we can use encoding better suited to the modern processors.

21. What is an ‘Accumulator’?
‘Accumulators’ are Spark’s offline debuggers. Similar to ‘Hadoop Counters’, ‘Accumulators’ provide the number of ‘events’ in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. ‘AggregrateByKey()’ and ‘combineByKey()’ uses accumulators. When Spark ships a code to every executor the variables become local to that executor and its updated value is not relayed back to the driver. To avoid this problem we need to make variable an accumulator such that all the updates to this variable in every executor is relayed back to the driver.

Author: user

Leave a Reply