New column which counts the number of times a value in a specific row of one column appears in anoth-CodePudding

I have tried searching for an answer to this question but it continues to elude me! I am working with crime data where each row refers to a specific crime incident. There is a variable for suspect ID, and a variable for victim ID. These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column, I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).

I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).

Here's a simplified version of my data:

	s.uid	v.uid
1	1	9
2	2	8
3	3	2
4	4	2
5	5	2
6	NA	7
7	5	6
8	9	5

And here is what I want to create:

	s.uid	v.uid	s.in.v	v.in.s
1	1	9	0	1
2	2	8	3	0
3	3	2	0	1
4	4	2	0	1
5	5	2	1	1
6	NA	7	NA	0
7	5	6	1	0
8	9	5	1	2

Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!

CodePudding user response：

Using dplyr:

dat %>% 
    group_by(s.uid) %>% 
    mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>% 
    group_by(v.uid) %>% 
    mutate(v.in.s = sum(dat$s.uid %in% v.uid))

# A tibble: 8 × 4
# Groups:   v.uid [6]
  s.uid v.uid s.in.v v.in.s
  <int> <int>  <int>  <int>
1     1     9      0      1
2     2     8      3      0
3     3     2      0      1
4     4     2      0      1
5     5     2      1      1
6    NA     7      0      0
7     5     6      1      0
8     9     5      1      2

CodePudding user response：

First, a reprex of your data:

library(tidyverse)

# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)

DF <- tibble(s.uid, v.uid)

Custom function to use:

# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
  a <- as.character(a)
  
  # make a lookup table a.k.a dictionary of values in b:
  b_freq <- table(b, useNA = "always")
  
  # if a is in b, return it's frequency:
  if (a %in% names(b_freq)) {
    return(b_freq[a])
  }
  
  # else (ie. a is not in b) return 0:
  return(0)
}

# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
  purrr::map_dbl(.x = a, .f = f, b = b)
}

Finally:

DF |> 
  mutate(
    s_in_v = ff(s.uid, v.uid), 
    v_in_s = ff(v.uid, s.uid)
  )

Results in:

#> # A tibble: 8 × 4
#>   s.uid v.uid s_in_v v_in_s
#>   <dbl> <dbl>  <dbl>  <dbl>
#> 1     1     9      0      1
#> 2     2     8      3      0
#> 3     3     2      0      1
#> 4     4     2      0      1
#> 5     5     2      1      1
#> 6    NA     7     NA      0
#> 7     5     6      1      0
#> 8     9     5      1      2