Apache PIG interview questions

1. What is pig?
Pig is a Apache open soucre project which run on top of hadoop,provides engine for data flow in para pllel on hadoop.It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing anad processing.Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, un­structured data in parallel.

2. What is difference between pig and sql?
Pig latin is procedural version of SQl.pig has certainly similarities,more difference from SQl. SQl is a query language for user asking question in query form. SQl makes answer for given but don’t tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

3. Key differences between PIG vs Map Reduce?
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing.
Map reduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Mapreduce development cycle is long and difficult to Join multiple data sets.

4. How is Pig Useful For?
In three categories,we can use pig .they are
1) ETL data pipline
2) Research on raw data
3) Iterative processing
Most common usecase for pig is data pipeline.Let us take one example, web based companies gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggregation operations..etc.i,e transformations on data. http://help.mortardata.com/data_apps/redshift_data_warehouse/the_example_etl_pipeline

5. What are the scalar datatypes in pig?
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
Bytearray

Author: user

Leave a Reply