Apache Spark interview questions

PySpark @ Freshers.in

127. What is Apache Mesos
Apache Mesos is a general-purpose cluster manager that can run both analytics workloads and long-running services (e.g., web applications or key/value stores) on a cluster. To use Spark on Mesos, pass a mesos:// URI to spark-submit:
spark-submit –master mesos://masternode:5050 yourapp

128. What is a stage in Apache Spark?
A stage is a set of independent tasks all computing the same function that need to run as part of a Spark job, where all the tasks have the same shuffle dependencies. Each DAG of tasks run by the scheduler is split up into stages at the boundaries where shuffle occurs, and then the DAGScheduler runs these stages in topological order.
Each Stage can either be a shuffle map stage, in which case its tasks’ results are input for another stage, or a result stage, in which case its tasks directly compute the action that initiated a job (e.g. count(), save(), etc). For shuffle map stages, we also track the nodes that each output partition is on.
Each Stage also has a jobId, identifying the job that first submitted the stage. When FIFO scheduling is used, this allows Stages from earlier jobs to be computed first or recovered faster on failure.

129. What is coalesce() and repartition() Apache Spark?
Sometimes, we want to change the partitioning of an RDD outside the context of grouping and aggregation operations. For those cases, Spark provides the repartition() function, which shuffles the data across the network to create a new set of partitions. Keep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that allows avoiding data movement, but only if you are decreasing the number of RDD partitions. To know whether you can safely call coalesce(), you can check the size of the RDD using rdd.partitions.size() in Java/Scala and rdd.getNumPartitions() in Python and make sure that you are coalescing.
coalesce() avoids a full shuffle. If it’s known that the number is decreasing then the executor can safely keep data on the minimum number of partitions, only moving the data off the extra nodes, onto the nodes that we kept.
So, it would go something like this:
Node 1 = 1,2,3
Node 2 = 4,5,6
Node 3 = 7,8,9
Node 4 = 10,11,12
Then coalesce down to 2 partitions:
Node 1 = 1,2,3 + (10,11,12)
Node 3 = 7,8,9 + (4,5,6)

130. What is sortByKey in Apache Spark ?
The sortByKey() function takes a parameter called ascending indicating whether we want it in ascending order (it defaults to true).
rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))

131. What are the Actions Available on Pair RDDs Apache Spark?
CountByKey()
colletAsMap()
lookup(key)

132. What are the Operations That Benefit from Partitioning in Apache Spark ?
Many of Spark’s operations involve shuffling data by key across the network. All of these will benefit from partitioning. As of Spark 1.0, the operations that benefit from partitioning are cogroup(), groupWith(), join(), leftOuterJoin(), rightOuter Join(), groupByKey(), reduceByKey(), combineByKey(), and lookup().

Author: user

Leave a Reply