Hive : How to drop duplicate rows from Hive table.

Hive @ Freshers.in

This is a work around to show how can we drop duplicate rows from Hive table.

Here is how to create a table named freshers_students and insert 10 records into it:

CREATE TABLE freshers_students(
    student_id INT, 
    first_name STRING,
    last_name STRING,
    age INT,
    major STRING,
    year_enrolled STRING,
    email STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

The above will create a Managed hive table.

Insert data in to the above table 

INSERT INTO TABLE freshers_students VALUES (1, "Sachin", "P", 22, "Computer Science", "2023", "Sachin.P@freshers.in");
INSERT INTO TABLE freshers_students VALUES (2, "Prabhath", "Roy", 21, "Physics", "2022", "Prabhath.Roy@freshers.in");
INSERT INTO TABLE freshers_students VALUES (3, "Alice", "Peter", 20, "Mathematics", "2023", "alice.Peter@freshers.in");
INSERT INTO TABLE freshers_students VALUES (4, "Aami", "Nair", 22, "Biology", "2023", "Aami.Nair@freshers.in");
INSERT INTO TABLE freshers_students VALUES (5, "Abhisarikaa", "Davis", 21, "Chemistry", "2022", "Abhisarikaa.davis@freshers.in");
INSERT INTO TABLE freshers_students VALUES (6, "Dhanya", "James", 20, "Psychology", "2023", "Dhanya.James@freshers.in");
INSERT INTO TABLE freshers_students VALUES (7, "Emily", "Jones", 22, "Political Science", "2023", "emily.jones@freshers.in");
INSERT INTO TABLE freshers_students VALUES (8, "Dheeman", "Jones", 21, "Economics", "2022", "Dheeman.Jones@freshers.in");
INSERT INTO TABLE freshers_students VALUES (9, "Iha", "Thomas", 20, "English", "2023", "Iha.Thomas@freshers.in");
INSERT INTO TABLE freshers_students VALUES (10, "Idhant", "Vedu", 22, "History", "2023", "Idhant.Vedu@freshers.in");

We will insert the same data in another way so that duplicate will be created 

INSERT INTO TABLE freshers_students
SELECT 1, "Sachin", "P", 22, "Computer Science", "2023", "Sachin.P@freshers.in" UNION ALL
SELECT 2, "Prabhath", "Roy", 21, "Physics", "2022", "Prabhath.Roy@freshers.in" UNION ALL
SELECT 3, "Alice", "Peter", 20, "Mathematics", "2023", "alice.Peter@freshers.in" UNION ALL
SELECT 4, "Aami", "Nair", 22, "Biology", "2023", "Aami.Nair@freshers.in" UNION ALL
SELECT 5, "Abhisarikaa", "Davis", 21, "Chemistry", "2022", "Abhisarikaa.davis@freshers.in" UNION ALL
SELECT 6, "Dhanya", "James", 20, "Psychology", "2023", "Dhanya.James@freshers.in" UNION ALL
SELECT 7, "Emily", "Jones", 22, "Political Science", "2023", "emily.jones@freshers.in" UNION ALL
SELECT 8, "Dheeman", "Jones", 21, "Economics", "2022", "Dheeman.Jones@freshers.in" UNION ALL
SELECT 9, "Iha", "Thomas", 20, "English", "2023", "Iha.Thomas@freshers.in" UNION ALL
SELECT 10, "Idhant", "Vedu", 22, "History", "2023", "Idhant.Vedu@freshers.in";

Now we have the duplicate data .

Hive is not designed for handling row-level updates or deletions, which includes removing duplicates. However, there are a few workarounds to achieve this. The most common way to remove duplicates from a Hive table is to create a new table that selects distinct records from the original table. Now we need to create another table with a similar structure from the freshers_students table by removing the duplicate . The below SQL will help for the same.

CREATE TABLE freshers_students_dedup AS
SELECT student_id, first_name, last_name, age, major, year_enrolled, email
FROM (
  SELECT student_id, first_name, last_name, age, major, year_enrolled, email,
         row_number() over (partition by student_id order by student_id) as row_num
  FROM freshers_students
) t
WHERE t.row_num = 1;

In this code, a new table named freshers_students_dedup is created. The ROW_NUMBER() function is used to assign a unique number to each row in the freshers_students table for each student_id. The PARTITION BY clause is used to reset the row number for each student_id. The ORDER BY clause is used to determine the order of the rows in each partition.

Finally, the WHERE clause is used to select only the first row for each student_id, effectively removing duplicates. This will work as long as student_id is the unique identifier for each student. If it’s not, you’ll need to adjust the PARTITION BY clause to use the actual unique identifier(s).

Now if you check the freshers_students_dedup table , you will find no duplicate. 

Now you need to rename the master table (freshers_students)  to an freshers_students_old table name.

ALTER TABLE freshers_students RENAME TO freshers_students_old;

Now you need to rename the freshers_students_dedup  to freshers_students table .

ALTER TABLE freshers_students_dedup RENAME TO freshers_students;

Make sure to back you the table , so if incase you need to revert.

Hive important pages to refer

  1. Hive
  2. Hive Interview Questions
  3. Hive Official Page
  4. Spark Examples
  5. PySpark Blogs
  6. Bigdata Blogs
  7. Spark Interview Questions
  8. Spark Official Page
Author: user

Leave a Reply