Adsense

Friday, April 26, 2013

DB2 Query to remove duplicate records using single field or multiple fields


Initially due to poor database design , if there is no primary key then database may allow duplicate records either by application or directly. Now you may need to add a primary key . But data base won't allow to add a primary key . Because it has duplicate values. Let us consider the Employee table. It has empcode , empname , designation, salary, email fields. If empcode is entered through manual , there may be a chance that the empcode to be duplicate. For removing the duplicate empcode or displaying duplicate records ,the following methods may be used .

Method I . If the table has already unique value field like Row_id , you can use that field as row-id , otherwise you can create unique value field (Row_id) using the following ways
a) Add a character filed (ROW_ID) with width 13 by the following commad
ALTER TABLE EMPLOYEE ADD COLUMN ROWID CHARACTER (13) ;
Now update the ROW_ID field with unique values with the following command
UPDATE EMPLOYEE SET ROW_ID=GENERATE_UNIQUE();
Now the ROW_ID field is filled with unique timestamp values.

b) Create a new table with identity column (ROW_ID) . Then insert the values from the original table to new table . Now the new tables ROW_ID column is filled with unique values starting 1 or mentioned during table creation. Commands are as follows
CREATE TABLE EMPLOYEE1 ( ROW_ID INT NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1, NO CACHE), empcode varchar(10) , empname varchar(50) , designation varchar(25) , salary decimal(10) ,email varchar(50) );
insert into EMPLOYEE1 (empcode, empname, designation , salary, email) select empcode, empname, designation, salary, email from Employee
You can use either a) or b)
Now You can display the duplicate records except unique values with following queries
select * FROM emp4 a WHERE ROW_ID> (SELECT min(ROW_ID) FROM emp4 b WHERE a.empcode = b.empcode );
To view only unique records , run the following query ,
-select * FROM emp4 a WHERE ROW_ID = (SELECT min(ROW_ID) FROM emp4 b WHERE a.empcode = b.empcode );
Either you can use MIN or MAX function , if you use MAX , then you have to use less than (<) symbol. In where condition of the subquery , you can combine more than one condition.
To delete duplicate records , run the following query
delete FROM emp4 a WHERE ROW_ID > (SELECT min(ROW_ID) FROM emp4 b WHERE a.empcode = b.empcode );
How does it works ? . ROW_ID column is unique because the values are generated by system. There is no chance of duplication. Suppose empcode 455 has five records but those 5 records should have 5 unique ROW_IDs, among those 5 ROW_IDs , one value must be minimum and one value must be maximum . So we can either keep MIN or MAX Row_id . Other rows we can delete.

Method I I :
Create a temporary table from the original table , the copy the distinct values from the original table to temp table. Now drop the orignal table and rename the temporary table to Original table
Generate ddl from the original table . and create a temp table
-insert into temp (empcode, empname, designation, salary, email) select distinct empcode, empname, designation, salary, email from employee
Drop table emploee
Rename table temp to employee

Note in the above select query distinct keyword is used for multiple columns.
-For DB2 UDB for iSeries SQL , you can use the DDL " create table Temp as (select distinct empcode , empname, designation, salary , email from employee )"

0 comments:

Post a Comment