How to update rows and then delete all but one. Data deduplication-CodePudding

Given a table like this:

ID	A	B	C	D
01	3	2	1	0
01	5	2	1	0
01	0	2	1	0
00	4	8	1	1
00	4	8	1	1
00	4	8	1	1
03	6	4	0	0
03	0	2	0	0
03	6	4	0	0

How could I use SQL (T-SQL/SQL Server Mgt Studio), R, or Python to do the following in this order:

For each unique ID, update col A with the maximum value of A from all rows and then delete all rows except the one with the highest value in col B. I need to end up with only one row for a given ID.

Written another way; I have duplicate data and need to delete duplicate rows but only after ensuring I have the maximum value of col A from any duplicates applied to the row with the maximum value of col B. I want to only keep one row for a given ID but the row I want to keep is the one with the max col B. If there are multiple identical ID & col B rows, delete all but one.

CodePudding user response：

In R:

library(dplyr)
your_data %>%
  group_by(ID) %>%
  mutate(A = max(A)) %>%
  slice(which.max(B)) %>%
  ungroup()

Or with data.table:

library(data.table)
setDT(your_data)
your_data[, A := max(A), by = ID][, .SD[which.max(B)], by = ID]

CodePudding user response：

In Python:

(df.assign(A=df.groupby('ID')['A'].transform(max))
   .sort_values(by=['B'], ascending=False)
   .drop_duplicates(subset=['ID'])
)

CodePudding user response：

In SQL using CTE

with cte as (
select id, max(a) a from your_table_name group by id),
cteb as (
select row_number() over (partition by id order by b desc) sn , 
id, b, c, d from your_table_name)

select id, a, b, c, d from cte  a inner join cteb b using(id) 
where sn =1 order by cast(id as integer)