I have tried searching for an answer to this question but it continues to elude me! I am working with crime data where each row refers to a specific crime incident. There is a variable for suspect ID, and a variable for victim ID. These ID numbers are consistent across the two columns (in other words, if a row contains the ID 424 in the victim ID column, and a separate row contains the ID 424 in the suspect column, I know that the same person was listed as a victim in the first crime and as a suspect in the second crime).
I want to create two new variables: one which counts the number of times the victim (in a particular crime incident) has been recorded as a suspect (in the dataset as a whole), and one which counts the number of times the suspect (in a particular crime incident) has been recorded as a victim (in the dataset as a whole).
Here's a simplified version of my data:
| s.uid | v.uid | |
|---|---|---|
| 1 | 1 | 9 |
| 2 | 2 | 8 |
| 3 | 3 | 2 |
| 4 | 4 | 2 |
| 5 | 5 | 2 |
| 6 | NA | 7 |
| 7 | 5 | 6 |
| 8 | 9 | 5 |
And here is what I want to create:
| s.uid | v.uid | s.in.v | v.in.s | |
|---|---|---|---|---|
| 1 | 1 | 9 | 0 | 1 |
| 2 | 2 | 8 | 3 | 0 |
| 3 | 3 | 2 | 0 | 1 |
| 4 | 4 | 2 | 0 | 1 |
| 5 | 5 | 2 | 1 | 1 |
| 6 | NA | 7 | NA | 0 |
| 7 | 5 | 6 | 1 | 0 |
| 8 | 9 | 5 | 1 | 2 |
Note that, where there is an NA, I would like the NA to be preserved. I'm currently trying to work in tidyverse and piping where possible, so I would prefer answers in that kind of format, but I'm open to any solution!
CodePudding user response:
Using dplyr:
dat %>%
group_by(s.uid) %>%
mutate(s.in.v = sum(dat$v.uid %in% s.uid)) %>%
group_by(v.uid) %>%
mutate(v.in.s = sum(dat$s.uid %in% v.uid))
# A tibble: 8 × 4
# Groups: v.uid [6]
s.uid v.uid s.in.v v.in.s
<int> <int> <int> <int>
1 1 9 0 1
2 2 8 3 0
3 3 2 0 1
4 4 2 0 1
5 5 2 1 1
6 NA 7 0 0
7 5 6 1 0
8 9 5 1 2
CodePudding user response:
First, a reprex of your data:
library(tidyverse)
# Replica of your data:
s.uid <- c(1:5, NA, 5, 9)
v.uid <- c(9, 8, 2, 2, 2, 7, 6, 5)
DF <- tibble(s.uid, v.uid)
Custom function to use:
# function to check how many times "a" (a length 1 atomic vector) occurs in "b":
f <- function(a, b) {
a <- as.character(a)
# make a lookup table a.k.a dictionary of values in b:
b_freq <- table(b, useNA = "always")
# if a is in b, return it's frequency:
if (a %in% names(b_freq)) {
return(b_freq[a])
}
# else (ie. a is not in b) return 0:
return(0)
}
# vectorise that, enabling intake of any length of "a":
ff <- function(a, b) {
purrr::map_dbl(.x = a, .f = f, b = b)
}
Finally:
DF |>
mutate(
s_in_v = ff(s.uid, v.uid),
v_in_s = ff(v.uid, s.uid)
)
Results in:
#> # A tibble: 8 × 4
#> s.uid v.uid s_in_v v_in_s
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 9 0 1
#> 2 2 8 3 0
#> 3 3 2 0 1
#> 4 4 2 0 1
#> 5 5 2 1 1
#> 6 NA 7 NA 0
#> 7 5 6 1 0
#> 8 9 5 1 2
