Apache PIG interview questions

user March 21, 2021 Leave a Comment

1. What is pig?
Pig is a Apache open soucre project which run on top of hadoop,provides engine for data flow in para pllel on hadoop.It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing anad processing.Pig is a platform to analyze large data sets that should either structured or unstructured data by using Pig latin scripting. Intentionally done for streaming data, unstructured data in parallel.

2. What is difference between pig and sql?
Pig latin is procedural version of SQl.pig has certainly similarities,more difference from SQl. SQl is a query language for user asking question in query form. SQl makes answer for given but don’t tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.

3. Key differences between PIG vs Map Reduce?
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store. As part of managing this data flow it moves data feeding it to process1, taking output and feeding it to process2. The core features are preventing execution of subsequent stages if previous stage fails, manages temporary storage of data and most importantly compresses and rearranges processing steps for faster processing.
Map reduce on the other hand is a data processing paradigm, it is a framework for application developers to write code in so that its easily scaled to PB of tasks, this creates a separation between the developer that writes the application vs the developer that scales the application. Mapreduce development cycle is long and difficult to Join multiple data sets.

4. How is Pig Useful For?
In three categories,we can use pig .they are
1) ETL data pipline
2) Research on raw data
3) Iterative processing
Most common usecase for pig is data pipeline.Let us take one example, web based companies gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggregation operations..etc.i,e transformations on data. http://help.mortardata.com/data_apps/redshift_data_warehouse/the_example_etl_pipeline

5. What are the scalar datatypes in pig?
int -4bytes,
float -4bytes,
double -8bytes,
long -8bytes,
chararray,
Bytearray

Post Views: 22

Related Posts

Apache Storm interview questions
1. What is Apache Storm? Apache Storm is a free and open source distributed realtime…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Data communication interview questions
1. What are the components of Data communication ? a. Message - It is the…

What are the Data Processing Operators in Snowflake ?
Filter : Represents an operation that filters the records. Attributes: Filter condition - the condition…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Data Structure interview questions
1. What is data structure? Data structure refers to the way data is organized and…

How does Snowflake differ from other data warehousing solutions
Snowflake is a cloud-based data warehousing solution that differs from traditional on-premises and other cloud-based…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

Pages: 1 2 3 4 5 6 7 8 9 10 11

Share: Twitter Facebook Pinterest Reddit VK Digg Linkedin Mix
Tagged Apache, Big Data, cloud, interview_qa, software_engineering, Technical

Author: user

Website

Related Articles

Amazon API Gateway interview questions

Hive interview questions

Artificial Intelligence interview questions

Database management system – DBMS

Compiler interview questions

Amazon RDS interview questions

Operating system interview questions

Informatica interview questions

Post navigation

Apache Spark interview questions →
← Apache Storm interview questions

Leave a Reply Cancel reply
You must be logged in to post a comment.

Search for:
Trending
DBT
Python
Numpy
PySpark
Hive
Snowflake
Redshift
Airflow
Aptitude

Recent Posts

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

Featured Posts – Slider Widget

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Understanding Series.transform(func[, axis])

Series.aggregate(func) : Pandas API on Spark

Series.agg(func) : Pandas API on Spark

Security Features of Snowflake

Related Posts

Apache Storm interview questions
1. What is Apache Storm? Apache Storm is a free and open source distributed realtime…

AWS Glue interview questions
For Spark please visit (1) Spark Interview Questions (2) Spark Examples (3) PySpark Blogs 1.…

When you should not use Apache Spark ? Explain with reason.
There are a few situations where it may not be appropriate to use Apache Spark,…

Data communication interview questions
1. What are the components of Data communication ? a. Message - It is the…

What are the Data Processing Operators in Snowflake ?
Filter : Represents an operation that filters the records. Attributes: Filter condition - the condition…

Installing Apache Spark standalone on Linux
Installing Spark on a Linux machine can be done in a few steps. The following…

Learn how to connect Hive with Apache Spark.
HiveContext is a Spark SQL module that allows you to work with Hive data in…

Data Structure interview questions
1. What is data structure? Data structure refers to the way data is organized and…

How does Snowflake differ from other data warehousing solutions
Snowflake is a cloud-based data warehousing solution that differs from traditional on-premises and other cloud-based…

PySpark : How to decode in PySpark ?
pyspark.sql.functions.decode The pyspark.sql.functions.decode Function in PySpark PySpark is a popular library for processing big data…

Most Viewed Posts

dbt (data build tool) interview questions

Python throwing as NameError: name ‘__file__’ is not defined – Solution

DBT command not found after intalling DBT-How to resolve.

BigQuery : Handle missing or null values in BigQuery

Airflow dags not getting refreshed/updating. How to do it manually?

How to delete a partition data as well from Hive external table on DROP command?

PySpark – groupby with aggregation (count, sum, mean, min, max)

Copyright © 2024 Freshers.in