Apache Storm interview questions

1. What is Apache Storm?
Apache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Apache Storm is simple, can be used with any programming language.

2. What are the different type of nodes on a Storm cluster?
There are two kinds of nodes on a Storm cluster: the master node and the worker nodes. The master node runs a daemon called “Nimbus” that is similar to Hadoop’s “JobTracker”. Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.Each worker node runs a daemon called the “Supervisor”. The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.

3. What is Topologies in Apache Storm ?
A topology is a graph of computation. Each node in a topology contains processing logic, and links between nodes indicate how data should be passed around between nodes. To do realtime computation on Storm, we need to create “topologies”. Since topology definitions are just Thrift structs, and Nimbus is a Thrift service, you can create and submit topologies using any programming language

4. What is Streams in Apache Storm ?
The core abstraction in Storm is the “stream”. A stream is an unbounded sequence of tuples. Storm provides the primitives for transforming a stream into a new stream in a distributed and reliable way.

5. What is Spouts in Apache Storm ?
A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology .
Spouts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

6. What is Bolts in Apache Storm?
All processing in topologies is done in bolts. Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

7. What are the advantages of using Apache Storm ?
Distributed real time processing
Stateless, Data is streamed
Stream abstraction
Micro batching processing

8. Explain the components of Apache Storm system?
Nimbus – This is a job tracker that distributes the code across clusters, computes the code and execute it.
Zookeeper – This is a mediator to communicate with the storm cluster.
Supervisor – It interacts with Nimbus with the help of zookeeper and executes the process as per the Nimbus’s instructions.

9. How many categories are there to define the stream grouping in Apache Storm?
Shuffle Grouping
Field Grouping
None Grouping
Local Grouping
Global Grouping
All Grouping
Direct grouping

10. Explain role of a zookeeper in Apache Storm?
Zookeeper is a mediator to facilitate coordination with clusters. Zookeeper is not involved in message passing, so workload on the zookeeper is very low.

11. When will you call the clean-up method in Apache Storm?
The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: for example, if the machine the task is running on blows up, there’s no way to invoke the method. The cleanup method is intended for when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.

12. How to set up SSL for Apache Storm?
For UI users needs to set following config in storm.yaml. Generating keystores with proper keys and certs should be taken care by the user before this step.
ui.https.port
ui.https.keystore.type (example “jks”)
ui.https.keystore.path (example “/etc/ssl/storm_keystore.jks”)
ui.https.keystore.password (keystore password)
ui.https.key.password (private key password)
optional config 6. ui.https.truststore.path (example “/etc/ssl/storm_truststore.jks”) 7. ui.https.truststore.password (truststore password) 8. ui.https.truststore.type (example “jks”)
If users want to setup 2-way auth 9. ui.https.want.client.auth (If this set to true server requests for client certifcate authentication, but keeps the connection if no authentication provided) 10. ui.https.need.client.auth (If this set to true server requires client to provide authentication)

13. Apache Kafka vs Apache Storm
a. Data Security
i. Apache Kafka
Basically, Kafka does not guarantee data loss, or we can say it have the very low guarantee. For Example, for 7 Million message transactions per day, Netflix achieved 0.01% of data loss.
ii. Apache Storm
On comparison with Kafka, Storm guarantees full data security.
b. Data Storage
i. Apache Kafka
Apache Kafka store its data on the local filesystem, such as EXT4 and XFS.
ii. Apache Storm
On the other hand, Storm is just a data processing framework. That says it doesn’t store data it just transfers it from input to Output stream.
c. Real-time messaging system
i. Apache Kafka
Before processing only, Kafka used to store incoming messages.
ii. Apache Storm
However, Storm works on a Real-time messaging system.
d. Processing/ Transforming
i. Apache Kafka
We use Apache Kafka for processing the real-time data.
ii. Apache Storm
Whereas, we use Storm for transforming the data.
e. Data Source
i. Apache Kafka
Basically, Kafka pulls the data from the actual source of data.
ii. Apache Storm
On the other hand, Storm gets the data from Kafka itself regarding further processes.
f. Basic Task
i. Apache Kafka
While it comes to transferring real-time application data from the source application to another, we use Kafka application.
ii. Apache Storm
Well, we use Storm for aggregation as well as computation purpose.
g. Zookeeper Dependency
i. Apache Kafka
While setting up the Kafka, it’s mandatory to have Apache Zookeeper.
ii. Apache Storm
Whereas, we don’t need Zookeeper to make Storm work.
h. Fault-Tolerant
i. Apache Kafka
Due to Zookeeper, Kafka is fault tolerant.
ii. Apache Storm
The storm is capable of auto-restart its daemons itself.
i. Inventor
i. Apache Kafka
Kafka is invented by LinkedIn.
ii. Apache Storm
Whereas, Twitter invented Apache Storm.
j. Language Support
i. Apache Kafka
Basically, Kafka can work with all languages but while it comes to work best, Kafka works best with Java language only.
ii. Apache Storm
Strom supports all the languages.

14. Does Apache Storm UI supprots REST API
The Storm UI daemon provides a REST API that allows you to interact with a Storm cluster, which includes retrieving metrics data and configuration information as well as management operations such as starting or stopping topologies.
The API base URL would thus be:
http://<ui-host>:<ui-port>/api/v1/…

15. What happens when a worker dies in Apache Storm?
When a worker dies, the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the worker.

16. What happens when a node dies in Apache Storm?
The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines.

17. What happens when Nimbus or Supervisor daemons die in Apache Storm?
The Nimbus and Supervisor daemons are designed to be fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk).Nimbus and Supervisor daemons must be run under supervision using a tool like daemontools or monit. So if the Nimbus or Supervisor daemons die, they restart like nothing happened. Most notably, no worker processes are affected by the death of Nimbus or the Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all the running jobs are lost.

18. Is Nimbus a single point of failure in Apache Storm?
If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won’t be reassigned to other machines when necessary (like if you lose a worker machine).

19. What is Apache Storm’s reliability API?
There are two things you have to do as a user to benefit from Storm’s reliability capabilities. First, you need to tell Storm whenever you’re creating a new link in the tree of tuples. Second, you need to tell Storm when you have finished processing an individual tuple. By doing both these things, Storm can detect when the tree of tuples is fully processed and can ack or fail the spout tuple appropriately. Storm’s API provides a concise way of doing both of these tasks.

20. What happens if a message is fully processed or fails to be fully processed in Apache Storm?
To understand this question, let’s take a look at the lifecycle of a tuple coming off of a spout. For reference, here is the interface that spouts implement (see the Javadoc for more information):

public interface ISpout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector collector);
void close();
void nextTuple();
void ack(Object msgId);
void fail(Object msgId);
}
First, Storm requests a tuple from the Spout by calling the nextTuple method on the Spout. The Spout uses the SpoutOutputCollector provided in the open method to emit a tuple to one of its output streams. When emitting a tuple, the Spout provides a “message id” that will be used to identify the tuple later. For example, the KestrelSpout reads a message off of the kestrel queue and emits as the “message id” the id provided by Kestrel for the message. Emitting a message to the SpoutOutputCollector looks like this:
_collector.emit(new Values(“field1”, “field2”, 3) , msgId);

Next, the tuple gets sent to consuming bolts and Storm takes care of tracking the tree of messages that is created. If Storm detects that a tuple is fully processed, Storm will call the ack method on the originating Spout task with the message id that the Spout provided to Storm. Likewise, if the tuple times-out Storm will call the fail method on the Spout. Note that a tuple will be acked or failed by the exact same Spout task that created it. So if a Spout is executing as many tasks across the cluster, a tuple won’t be acked or failed by a different task than the one that created it.

Let’s use KestrelSpout again to see what a Spout needs to do to guarantee message processing. When KestrelSpout takes a message off the Kestrel queue, it “opens” the message. This means the message is not actually taken off the queue yet, but instead placed in a “pending” state waiting for acknowledgement that the message is completed. While in the pending state, a message will not be sent to other consumers of the queue. Additionally, if a client disconnects all pending messages for that client are put back on the queue. When a message is opened, Kestrel provides the client with the data for the message as well as a unique id for the message. The KestrelSpout uses that exact id as the “message id” for the tuple when emitting the tuple to the SpoutOutputCollector. Sometime later on, when ack or fail are called on the KestrelSpout, the KestrelSpout sends an ack or fail message to Kestrel with the message id to take the message off the queue or have it put back on.

21. How many Workers should I use in Apache Storm?
The total number of workers is set by the supervisors — there’s some number of JVM slots each supervisor will superintend. The thing you set on the topology is how many worker slots it will try to claim.
There’s no great reason to use more than one worker per topology per machine.
With one topology running on three 8-core nodes, and parallelism hint 24, each bolt gets 8 executors per machine, i.e. one for each core. There are three big benefits to running three workers (with 8 assigned executors each) compare to running say 24 workers (one assigned executor each).

22. How do you set the batch size in Apache Storm?
Trident doesn’t place its own limits on the batch count. In the case of the Kafka spout, the max fetch bytes size divided by the average record size defines an effective records per subbatch partition.

23. How does Apache Storm implement reliability in an efficient way?
A Storm topology has a set of special “acker” tasks that track the DAG of tuples for every spout tuple. When an acker sees that a DAG is complete, it sends a message to the spout task that created the spout tuple to ack the message. You can set the number of acker tasks for a topology in the topology configuration using Config.TOPOLOGY_ACKERS. Storm defaults TOPOLOGY_ACKERS to one task per worker.

Author: user

Leave a Reply

Your email address will not be published. Required fields are marked *