Apache PIG interview questions

31. What Is Pig Useful For?
Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing.
For research on raw data, some users prefer Pig Latin. Because Pig can operate in situations where the schema is unknown, incomplete, or inconsistent, and because it can easily manage nested data, researchers who want to work on data before it has been cleaned.

32. What are the thing pig need to know to run on a cluster
The only thing Pig needs to know to run on your cluster is the location of your cluster’s
NameNode and JobTracker. The NameNode is the manager of HDFS, and the Job- Tracker coordinates MapReduce jobs.

33. Where can I find Hadoop location ?
Hadoop location are found in hadoop-site.xml file. In Hadoop 0.20 and later, they are in three separate files: core-site.xml, hdfs-site.xml, and mapred-site.xml.
Hadoop-site.xml will look like the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>namenode_hostname:port</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>jobtrack_hostname:port</value>
</property>
</configuration>
PIG_CLASSPATH=hadoop_conf_dir pig_path/bin/pig -e fs -mkdir /user/username
pig -e fs -ls will list your home directory of HDFS

34. What are the Complex Types in PIG
Pig has three complex data types: maps, tuples, and bags.
Map
A map in Pig is a chararray to data element mapping, where that element can be any
Pig type, including a complex type. The chararray is called a key and is used as an index
to find the element, referred to as the value. Pig does not know the type of the value, it will assume it is a bytearray.If you do not cast the value, Pig will make a best guess based on how you use the value in your script.If the value is of a type other than bytearray, Pig will figure that out at runtime and handle it. Map constants are formed using brackets to delimit the map, a hash between keys and values, and a comma between key-value pairs. For example, [‘name’#’bob’,’age’#55] will create a map with two keys, “name” and “age”.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to
reference tuples in a bag by position.
Bag is the one type in Pig that is not required to fit into memory. As you will see later,
because bags are used to store collections when grouping, bags can become quite large.

35. What makes easier to program in Apache Pig than Hadoop MapReduce?
The initial step of a PigLatin program is to load the data from HDFS. Run the data through a series of business transformations (these transformations are internally converted to MapReduce task, so the developers don’t have to write the Java code for the business logic). Store the results in a file or present them on the interface.

Author: user

Leave a Reply