Home > Software design >  How can I delete duplicates from a table based on a location id and name
How can I delete duplicates from a table based on a location id and name

Time:01-15

columns are: Name, Location_Name, Location_ID

I want to check Names and Location_ID, and if there are two that are the same I want to delete/remove that row.

For example: If Name John Fox at location id 4 shows up two or more times I want to just keep one.

After that, I want to count how many people per location. Location_Name1: 45 Location_Name2: 66 Etc... The location name and Location Id are related. Sample data

Code I tried

CodePudding user response:

Deleting duplicates is a common pattern. You apply a sequence number to all of the duplicates, then delete any that aren't first. In this case I order arbitrarily but you can choose to keep the one with the lowest PK value or that was modified last or whatever - just update the ORDER BY to sort the one you want to keep first.

;WITH cte AS 
(
  SELECT *, rn = ROW_NUMBER() OVER 
    (PARTITION BY Name, Location_ID ORDER BY @@SPID)
  FROM dbo.TableName
)
DELETE cte WHERE rn > 1;

Then to count, assuming there can't be two different Location_IDs for a given Location_Name (this is why schema sample data is so helpful):

SELECT Location_Name, People = COUNT(Name)
  FROM dbo.TableName
  GROUP BY Location_Name;

If Location_Name and Location_ID are not tightly coupled (e.g. there could be Location_ID = 4, Location_Name = Place 1 and Location_ID = 4, Location_Name = Place 2 then you're going to have to define how to determine which place to display if you group by Location_ID, or admit that perhaps one of those columns is meaningless.

If Location_Name and Location_ID are tightly coupled, they shouldn't both be stored in this table. You should have a lookup/dimension table that stores both of those columns (once!) and you use the smallest data type as the key you record in the fact table (where it is repeated over and over again). This has several benefits:

  • Scanning the bigger table is faster, because it's not as wide
  • Storage is reduced because you're not repeating long strings over and over and over again
  • Aggregation is clearer and you can join to get names after aggregation, which will be faster
  • If you need to change a location's name, you only need to change it in exactly one place

CodePudding user response:

Sample code

CREATE TABLE People_Location
(
Name VARCHAR(30) NOT NULL,
Location_Name VARCHAR(30) NOT NULL,
Location_ID INT NOT NULL,
)

INSERT INTO People_Location
VALUES
('John Fox', 'Moon', 4),
('John Bear', 'Moon', 4),
('Peter', 'Saturn', 5),
('John Fox', 'Moon', 4),
('Micheal', 'Sun', 1),
('Jackie', 'Sun', 1),
('Tito', 'Sun', 1),
('Peter', 'Saturn', 5)

Get location and count

select Location_Name, count(1)
from 
(select Name, Location_Name,
   rn = ROW_NUMBER() OVER (PARTITION BY Name, Location_ID ORDER BY Name)
   from People_Location
 ) t
 where rn = 1
group by  Location_Name

Result

Moon  2
Saturn 1
Sun  3
  •  Tags:  
  • Related