Hive : Optimizing queries using Materialized Views using REWRITE option

Hive @ Freshers.in

Apache Hive is a popular data warehousing tool built on top of Hadoop for managing and querying large datasets. Among various features and options, one that stands out is the REWRITE option, especially useful in optimizing queries using Materialized Views (MVs).

Materialized Views provide a way to precompute and store aggregated or processed data for more efficient querying. With the REWRITE option, Hive can automatically utilize these Materialized Views when executing queries, reducing processing time and resources.

In this article, we will explore the REWRITE option in Hive, focusing on how to create tables using the freshers_in_ prefix and leverage Materialized Views.

1. Creating Tables with freshers_in_ Prefix

Let’s start by creating a table named “freshers_in_employees” with necessary columns:

CREATE TABLE freshers_in_employees (
    id INT,
    name STRING,
    department STRING,
    salary FLOAT
)
STORED AS ORC;

2. Materialized Views in Hive

Materialized Views are stored queries that contain the result set of a query. They can be leveraged by Hive’s query optimizer to rewrite queries, automatically substituting parts of a user’s query with the precomputed results in the Materialized View.

For instance, let’s create a Materialized View that contains the average salary per department from the “freshers_in_employees” table:

CREATE MATERIALIZED VIEW freshers_in_avg_salary_by_department
AS SELECT department, AVG(salary) AS avg_salary
FROM freshers_in_employees
GROUP BY department;

3. Query Optimization with REWRITE

Once the Materialized View is created, Hive’s query optimizer can use it to rewrite suitable queries, automatically substituting them with the Materialized View. The REWRITE option enables this behavior.

Consider the following query, which retrieves the average salary by department:

SELECT department, AVG(salary) AS avg_salary
FROM freshers_in_employees
GROUP BY department;

With the REWRITE option enabled, Hive can automatically rewrite this query to:

SELECT department, avg_salary
FROM freshers_in_avg_salary_by_department;

This rewriting reduces the computation required to answer the query, as the aggregation is already computed and stored in the Materialized View.

4. Enabling and Disabling Automatic Query Rewrite

The automatic query rewrite functionality can be controlled using the hive.materializedview.rewriting configuration parameter. It can be set to true or false to enable or disable the feature, respectively.

Enable:

SET hive.materializedview.rewriting=true;

Disable:

SET hive.materializedview.rewriting=false;

5. Manual Control of REWRITE

In some cases, developers may wish to manually control the rewriting process. This can be done using the EXPLAIN REWRITE command to see how the optimizer would rewrite a given query using the Materialized Views.

EXPLAIN REWRITE
SELECT department, AVG(salary) AS avg_salary
FROM freshers_in_employees
GROUP BY department;

The REWRITE option in Hive represents an advanced optimization technique, enhancing query performance through the utilization of Materialized Views.

Hive important pages to refer

  1. Hive
  2. Hive Interview Questions
  3. Hive Official Page
  4. Spark Examples
  5. PySpark Blogs
  6. Bigdata Blogs
  7. Spark Interview Questions
  8. Spark Official Page
Author: user

Leave a Reply