Apache Spark interview questions

PySpark @ Freshers.in

36. What is Lazy evaluated?
If you execute a bunch of program, it’s not mandatory to evaluate immediately. Especially in Transformations, this Laziness is trigger.

37. What is Catchable?
Keep all the data in­ memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.

38. What is Spark engine responsibility?
Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

39. What are common Spark Ecosystems?
Spark SQL(Shark) for SQL developers,
Spark Streaming for streaming data,
MLLib for machine learning algorithms,
GraphX for Graph computation,
SparkR to run R on Spark engine,
BlinkDB enabling interactive queries over massive data are common Spark ecosystems. GraphX, SparkR and BlinkDB are in incubation stage.

40. What is Partitions?
Partition is a logical division of the data, this idea derived from Map­reduce (split). Logical data specifically derived to process the data. Small chunks of data also it can support scalability and speed up the process. Input data, intermediate data and output data everything is Partitioned RDD.

41. How spark partition the data?
Spark use map­reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

42. How Spark store the data?
Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.

Author: user

Leave a Reply