Apache Spark interview questions

PySpark @ Freshers.in

71. What is the difference between persist() and cache()
Answer : Persist () allows the user to specify the storage level whereas cache () uses the default storage level.With cache(), you use only the default storage level MEMORY_ONLY. With persist(), you can specify which storage level you want.So cache() is the same as calling persist() with the default storage level.Spark has many levels of persistence to choose from based on what our goals are.The default persist() will store the data in the JVM heap as unserialized objects. When we write data out to disk, that data is also always serialized.

72. How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function. Remove elements with a key present in the other RDD.
>>> x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 2)])
>>> y = sc.parallelize([(“a”, 3), (“c”, None)])
>>> sorted(x.subtractByKey(y).collect())
[(‘b’, 4), (‘b’, 5)]

73. What is Spark Core?
It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.

74. Is Apache Spark a good fit for Reinforcement learning?
No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

75.Explain about the popular use cases of Apache Spark
Apache Spark is mainly used for
Iterative machine learning.
Interactive data analytics and processing.
Stream processing
Sensor data processing

76. Explain about the different types of transformations on DStreams?
Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.

77. What do you understand by Pair RDD?
Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network.

Author: user

Leave a Reply