Given a table like this:
| ID | A | B | C | D |
|---|---|---|---|---|
| 01 | 3 | 2 | 1 | 0 |
| 01 | 5 | 2 | 1 | 0 |
| 01 | 0 | 2 | 1 | 0 |
| 00 | 4 | 8 | 1 | 1 |
| 00 | 4 | 8 | 1 | 1 |
| 00 | 4 | 8 | 1 | 1 |
| 03 | 6 | 4 | 0 | 0 |
| 03 | 0 | 2 | 0 | 0 |
| 03 | 6 | 4 | 0 | 0 |
How could I use SQL (T-SQL/SQL Server Mgt Studio), R, or Python to do the following in this order:
For each unique ID, update col A with the maximum value of A from all rows and then delete all rows except the one with the highest value in col B. I need to end up with only one row for a given ID.
Written another way; I have duplicate data and need to delete duplicate rows but only after ensuring I have the maximum value of col A from any duplicates applied to the row with the maximum value of col B. I want to only keep one row for a given ID but the row I want to keep is the one with the max col B. If there are multiple identical ID & col B rows, delete all but one.
CodePudding user response:
In R:
library(dplyr)
your_data %>%
group_by(ID) %>%
mutate(A = max(A)) %>%
slice(which.max(B)) %>%
ungroup()
Or with data.table:
library(data.table)
setDT(your_data)
your_data[, A := max(A), by = ID][, .SD[which.max(B)], by = ID]
CodePudding user response:
In Python:
(df.assign(A=df.groupby('ID')['A'].transform(max))
.sort_values(by=['B'], ascending=False)
.drop_duplicates(subset=['ID'])
)
CodePudding user response:
In SQL using CTE
with cte as (
select id, max(a) a from your_table_name group by id),
cteb as (
select row_number() over (partition by id order by b desc) sn ,
id, b, c, d from your_table_name)
select id, a, b, c, d from cte a inner join cteb b using(id)
where sn =1 order by cast(id as integer)
