Hive : Demystifying ‘ISOLATION’ Levels in Apache Hive

Hive @ Freshers.in

What is ISOLATION in Hive?

In the context of databases, ‘ISOLATION’ is a property that defines how/when the changes made by one operation become visible to other concurrent operations. It is one of the four key properties defined by the ACID (Atomicity, Consistency, Isolation, Durability) principle for reliable processing of database transactions.

Apache Hive, by design, does not support full ACID properties as traditional RDBMS do. It does, however, support a concept of ‘ISOLATION’ in a limited manner. The term ‘ISOLATION’ in Hive particularly refers to the ‘READ’ isolation level provided through the implementation of the ‘Hive Transactions’ feature.

There are two types of isolation levels available in Hive:

  1. Read Committed: This is the default isolation level. An operation may see the effects of other completed operations but won’t see the changes of the operations still in progress.
  2. Repeatable Read: At this isolation level, each operation reads the entire snapshot of the data as of the beginning of the operation. Hence, it will not see any changes made by other transactions that are committed after its start time.

When to Use ISOLATION?

The choice of using a specific isolation level in Hive depends largely on your use case.

  • You’d use the Read Committed level when you want to see the latest committed data. This level provides a guarantee that any data read has been committed at the moment it was read, ensuring no ‘dirty reads’.
  • The Repeatable Read level is used when you need consistent results for multiple reads within the same transaction. This level provides a guarantee that any data read once can be re-read with the expectation of retrieving the same data, ensuring no ‘phantom reads’.

Conversions Available

There’s no direct conversion available for the isolation levels as they are the property of the Hive transaction management system. However, you can set different isolation levels based on your requirements by using the SET command:

-- Set Read Committed isolation level
SET hive.txn.read_committed=true;
-- Set Repeatable Read isolation level
SET hive.txn.read_committed=false;

Example DDL + INSERT with ISOLATION Use

Below is an example of a Hive table creation, data insertion, and a transaction using different isolation levels.

-- Create table
CREATE TABLE IF NOT EXISTS employees (
  id INT,
  name STRING,
  department STRING
) CLUSTERED BY(id) INTO 3 BUCKETS STORED AS ORC TBLPROPERTIES('transactional'='true');

-- Insert data
INSERT INTO TABLE employees (id, name, department)
VALUES 
(1, 'Sachin', 'Engineering'),
(2, 'Rajesh', 'Marketing'),
(3, 'Dilu', 'Sales'),
(4, 'Prakash', 'Marketing'),
(5, 'Indu', 'Sales');

-- Set Repeatable Read isolation level and start transaction
SET hive.txn.read_committed=false;
START TRANSACTION;

-- Query data within the transaction
SELECT * FROM employees WHERE department = 'Sales';

-- (Assume some other transaction modifies the 'Sales' department data here)

-- Query again
SELECT * FROM employees WHERE department = 'Sales';

-- Commit transaction
COMMIT;

Regardless of any changes made by other transactions during the active transaction, the data retrieved by the two SELECT statements within the transaction will be the same.

Author: user

Leave a Reply