Apache PIG interview questions

26. What is BloomMapFile used for?
The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

27. What is the difference between logical and physical plans?
Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.

28. Does ‘ILLUSTRATE’ run MR job?
No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.ILLUSTRATE operator is used to review how data is transformed through a sequence of Pig Latin statements. ILLUSTRATE command is your best friend when it comes to debugging a script. This command alone might be a good reason for choosing Pig over something else.Illustrate takes a sample of the data and whenever it comes across operators like join or filter that remove data, it ensures that only some records pass through and some do not, by making modifications to the records such that they meet the condition.

29. Can we say co-group is a group of more than 1 data set?
Co-group is a group of one data set. But in the case of more than one data sets, co-group will group all the data sets and join them based on the common field. Hence, we can say that co-group is a group of more than one data set and join of that data set as well.

30. How Pig differs from MapReduce
1. Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce. Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance.
2. Apache PIG is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:
Get top N elements for each group;
Calculate total per each group and than put that total against each row in the group;
Use Bloom filters for JOIN optimisations;
3. Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)
4. In MapReduce, the data processing inside the map and reduce phases is opaque to the system.
5. MapReduce does not have a type system. This is intentional, and it gives users the
flexibility to use their own data types and serialization frameworks.

Author: user

Leave a Reply