In this data:
df <- structure(list(Utterance = c("how old's your mom¿",
"how old's your mom¿",
"how old's your mom¿",
"how old's your mom¿",
"how old's your mom¿",
"how old's your mom¿",
"(0.855)", "(0.855)", "(0.855)", "eh six:ty:::-one=", "eh six:ty:::-one=",
"eh six:ty:::-one=", "[when was] that¿=", "[when was] that¿=",
"[when was] that¿=", "[when was] that¿=", "[yes] (0.163) =!this! was on °Wednesday°",
"[yes] (0.163) =!this! was on °Wednesday°", "[yes] (0.163) =!this! was on °Wednesday°",
"[yes] (0.163) =!this! was on °Wednesday°"),
G_by = c("A","A", "A", "C", "C", "C", "B", "B", "B", "A", "B", "C", "A", "A",
"B", "C", "A", "A", "A", "B"),
G_to = c("B", "*", "C", "A", "A", "B", "C", "A", "C", "C", "A", "B", "*", "C", "A", "A", "C", "*",
"C", "A")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
I need to count the number of members in groups Utterance and G_to based on conditions:
- discount
* - discount if next character is same as prior character, e.g.,
CC - discount if next character is same as character prior to last
*, e.g.,C*C
I only manage discounting *:
df %>%
# group:
group_by(Utterance, G_by) %>%
# create new column:
mutate(
N_G = sum(G_to %in% c("A", "B", "C")))
The result I am after is this:
# A tibble: 20 × 4
# Groups: Utterance, G_by [11]
Utterance G_by G_to N_G
<chr> <chr> <chr> <int>
1 how old's your mom¿ A B 2
2 how old's your mom¿ A * 2
3 how old's your mom¿ A C 2
4 how old's your mom¿ C A 2
5 how old's your mom¿ C A 2
6 how old's your mom¿ C B 2
7 (0.855) B C 3
8 (0.855) B A 3
9 (0.855) B C 3
10 eh six:ty:::-one= A C 1
11 eh six:ty:::-one= B A 1
12 eh six:ty:::-one= C B 1
13 [when was] that¿= A * 1
14 [when was] that¿= A C 1
15 [when was] that¿= B A 1
16 [when was] that¿= C A 1
17 [yes] (0.163) =!this! was on °Wednesday° A C 1
18 [yes] (0.163) =!this! was on °Wednesday° A * 1
19 [yes] (0.163) =!this! was on °Wednesday° A C 1
20 [yes] (0.163) =!this! was on °Wednesday° B A 1
How can that be obtained?
CodePudding user response:
Subset the column values, use rleid and then get the n_distinct on that
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(Utterance, G_by) %>%
mutate(N_G = na_if(G_to, "*")) %>%
fill(N_G, .direction = 'downup') %>%
mutate(N_G = n_distinct(rleid(N_G))) %>%
ungroup
-output
# A tibble: 20 × 4
Utterance G_by G_to N_G
<chr> <chr> <chr> <int>
1 how old's your mom¿ A B 2
2 how old's your mom¿ A * 2
3 how old's your mom¿ A C 2
4 how old's your mom¿ C A 2
5 how old's your mom¿ C A 2
6 how old's your mom¿ C B 2
7 (0.855) B C 3
8 (0.855) B A 3
9 (0.855) B C 3
10 eh six:ty:::-one= A C 1
11 eh six:ty:::-one= B A 1
12 eh six:ty:::-one= C B 1
13 [when was] that¿= A * 1
14 [when was] that¿= A C 1
15 [when was] that¿= B A 1
16 [when was] that¿= C A 1
17 [yes] (0.163) =!this! was on °Wednesday° A C 1
18 [yes] (0.163) =!this! was on °Wednesday° A * 1
19 [yes] (0.163) =!this! was on °Wednesday° A C 1
20 [yes] (0.163) =!this! was on °Wednesday° B A 1
