Home > Back-end >  Identify groups of observations with identical values for sets of variables
Identify groups of observations with identical values for sets of variables

Time:01-25

I have a data.frame like this:

example.df <- data.frame(grp = rep(1:3, each = 2),
                         attr.a = c(1, 4, 1, 7, 4, 1),
                         attr.b = c(2, 5, 2, 8, 5, 2),
                         attr.c = c(3, 6, 3, 9, 6, 3))

> example.df
  grp attr.a attr.b attr.c
1   1      1      2      3
2   1      4      5      6
3   2      1      2      3
4   2      7      8      9
5   3      4      5      6
6   3      1      2      3

I want to identify if any groups (grp) contain identical sets of observations for attr.a, attr.b and attr.c. In the example, grp 1 & 3 contain identical sets of observations (but, the sets of observations are not ordered in the same way). Also in the example, grp 2 contains only one set of observations that are identical to grp 1 & 3. It should only be identified if all sets of observations are identical.

It isn't important how the identical grps are identified, as long as it is clear, but could look something like this:

  grp attr.a attr.b attr.c ident
1   1      1      2      3     3
2   1      4      5      6     3
3   2      1      2      3     0
4   2      7      8      9     0
5   3      4      5      6     1
6   3      1      2      3     1

Thanks!

CodePudding user response:

We may unite the 'attr' columns to a single column, grouped by 'grp', sort the pasted elements to create a single string, use match on the attr and grouped by the index, reverse the 'grp' if there are more than one element, do a join with the original dataset

library(dplyr)
library(stringr)
library(tidyr)
example.df %>%
   unite(attr, starts_with('attr'), sep="") %>% 
   group_by(grp) %>% 
   summarise(attr = str_c(sort(attr), collapse = "")) %>%
   mutate(new = match(attr, unique(attr))) %>% 
   group_by(new) %>%
   mutate(ident = if(n() == 2) rev(grp) else 0) %>% 
   ungroup %>%
   select(-new, -attr) %>%
   left_join(example.df, .)

-output

   grp attr.a attr.b attr.c ident
1   1      1      2      3     3
2   1      4      5      6     3
3   2      1      2      3     0
4   2      7      8      9     0
5   3      4      5      6     1
6   3      1      2      3     1
  •  Tags:  
  • Related