Apache Spark interview questions

PySpark @ Freshers.in

29. Can we do real-time processing using Spark SQL?
Not directly but we can register an existing RDD as a SQL table and trigger SQL queries on top of that.

30. What is the biggest shortcoming of Spark?
Spark utilizes more storage space compared to Hadoop and MapReduce.
Also, Spark streaming is not actually streaming, in the sense that some of the window functions cannot properly work on top of micro batching.
No File Management System
Expensive – In-memory capability

31. What is Spark?
Spark is a parallel data processing framework. It allows to develop fast, unified big data application combine batch, streaming and interactive analytics.

32. Why Spark?
Spark is third generation distributed data processing platform. It’s unified bigdata solution for all bigdata processing problems such as batch , interacting, streaming processing.So it can ease many bigdata problems.

33. What is RDD?
Spark’s primary core abstraction is called Resilient Distributed Datasets. RDD is a collection of partitioned data that satisfies these properties. Immutable, distributed, lazily evaluated, catchable are common RDD properties.

34. What is Immutable?
Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it’s not allows updates and modifications. Please note data collection is not immutable, but data value is immutable.

35. What is Distributed?
RDD can automatically the data is distributed across different parallel computing nodes.

Author: user

Leave a Reply