I have this type of data, with frequency data and position data grouped by rowid:
df
rowid word f position
1 2 i 700 1
2 2 'm 600 2
3 2 fine 1 3
4 3 how 400 1
5 3 's 500 2
6 3 the 700 3
7 3 weather 20 4
8 4 it 390 1
9 4 's 500 2
10 4 really 177 3
11 4 very 200 4
12 4 cold 35 5
13 5 i 700 1
14 5 love 199 2
15 5 you 400 3
The task I'm facing seems simple: in those rowids where there are more than 3 positions, I need to replace the frequencies of all middle positions with their average. The following approach works but seems over-convoluted, so I'm almost certain there will be a more straightforward dplyrway to get the desired output:
df %>%
group_by(rowid) %>%
# filter for 'middle' positions:
filter(position != first(position) & position != last(position)) %>%
# summarise:
summarize(across(position),
# create average frequency:
f_middle_position = mean(f, na.rm = TRUE),
# concatenate words:
word = str_c(word, collapse = " ")
) %>%
filter(!duplicated(f_middle_position)) %>%
# join with df:
left_join(df, ., by = c("rowid", "position")) %>%
# remove rows other than #1,#2, and last:
group_by(rowid) %>%
# create row count:
mutate(rn = row_number()) %>%
# filter first, second, and last row per group:
filter(rn %in% c(1, 2, last(rn))) %>%
# transfer frequencies for middle positions:
mutate(f = ifelse(is.na(f_middle_position), f, f_middle_position)) %>%
# make more changes:
mutate(
# change position labels:
position = ifelse(position == first(position), 1,
ifelse(position == last(position), 2, 1.5)),
# update word:
word = ifelse(is.na(word.y), word.x, word.y)
) %>%
# remove obsolete variables:
select(-c(f_middle_position, word.y, word.x,rn))
A tibble: 12 × 4
# Groups: rowid [4]
rowid f position word
<dbl> <dbl> <dbl> <chr>
1 2 700 1 i
2 2 600 1.5 'm
3 2 1 2 fine
4 3 400 1 how
5 3 600 1.5 's the
6 3 20 2 weather
7 4 390 1 it
8 4 292. 1.5 's really very
9 4 35 2 cold
10 5 700 1 i
11 5 199 1.5 love
12 5 400 2 you
How can this result be obtained in a more concise way in dplyr and, preferably without the left_join, which causes problems with my actual data?
Data:
df <- data.frame(
rowid = c(2,2,2,3,3,3,3,4,4,4,4,4,5,5,5),
word = c("i","'m","fine",
"how","'s","the","weather",
"it","'s","really", "very","cold",
"i","love","you"),
f = c(700,600,1,
400,500,700,20,
390,500,177,200,35,
700,199,400),
position = c(1,2,3,
1,2,3,4,
1,2,3,4,5,
1,2,3)
)
CodePudding user response:
You can create a group variable pos that marks the first row with 1, the middle with 1.5, and the last with 2. Then group the data by rowid and pos and apply mean() and paste() on f and word respectively.
library(dplyr)
df %>%
group_by(rowid) %>%
mutate(pos = case_when(position == 1 ~ 1, position == n() ~ 2, TRUE ~ 1.5)) %>%
group_by(rowid, pos) %>%
summarise(f = mean(f), word = paste(word, collapse = ' '), .groups = 'drop')
# # A tibble: 12 × 4
# rowid pos f word
# <dbl> <dbl> <dbl> <chr>
# 1 2 1 700 i
# 2 2 1.5 600 'm
# 3 2 2 1 fine
# 4 3 1 400 how
# 5 3 1.5 600 's the
# 6 3 2 20 weather
# 7 4 1 390 it
# 8 4 1.5 292. 's really very
# 9 4 2 35 cold
# 10 5 1 700 i
# 11 5 1.5 199 love
# 12 5 2 400 you
