Assuming this data:
df <- tibble(
word = c("apple", "apple","banana", "pear","pear","A","A","A"),
i = seq_along(word),year=c(2000,2001,2000,2000,2001,2000,2001,2002)
)
I want to calculate the mean of column i for all words (per year) in column word without A and banana. and give the name out the output would be
word i year
<chr> <int> <dbl>
1 out 2.5 2000
2 out 3.5 2001
3 banana 4 2000
5 A 7 2000
8 A 8 2001
9 A 9 2002
CodePudding user response:
We may group by 'year', and the 'word' after replaceing the elements that are not 'A', 'banana' with 'out' and get the mean of 'i' and order the rows by 'group' (arrange) if neccessary
library(dplyr)
df %>%
group_by(year, word = replace(word, !(word %in% c('A', 'banana')), 'out')) %>%
summarise(i = mean(i), .groups = 'drop') %>%
arrange(factor(word, levels = c('out', 'banana', 'A'))) %>%
select(names(df))
CodePudding user response:
First we replace all not banana and A with out in an ifelse statement, then group and summarise. Finally arrange:
library(dplyr)
df %>%
mutate(word = ifelse(word != "banana" & word !="A", "out", word)) %>%
group_by(year, word) %>%
summarise(mean_i = mean(i)) %>%
arrange(desc(word))
year word mean_i
<dbl> <chr> <dbl>
1 2000 out 2.5
2 2001 out 3.5
3 2000 banana 3
4 2000 A 6
5 2001 A 7
6 2002 A 8
CodePudding user response:
Using data.table:
library(data.table)
dt <- as.data.table(df)
# assign name "mean" to the new column with average (mean) values
> dt[!(word %in% c('banana', 'A') ), .(mean = mean(i)), by = .(year, word)]
year word mean
1: 2000 apple 1
2: 2001 apple 2
3: 2000 pear 4
4: 2001 pear 5
CodePudding user response:
Base R solution
We first split the data by word and, within word, year. Then give us the mean of i. Store the mean values of i pertaining to all words but 'A' and 'apple' in out. Finally, make a data.frame with columns for word, year, and our means and use some regex to obtain tidy column names.
means <- sapply(split(df, f = ~ df$word df$year), \(x) mean(x$i))
out <- means[!grepl('^[Aa]', names(means))]
df_new <- data.frame(
word=gsub('\\.[0-9]*', '', names(out)),
year=gsub('[^[0-9]]*', '', names(out)),
mean_i=out
)
rownames(df_new) <- NULL
Output
> df_new
word year mean_i
1 banana 2000 3
2 pear 2000 4
3 banana 2001 NaN
4 pear 2001 5
5 banana 2002 NaN
6 pear 2002 NaN
If you want to omit rows containing missing values of mean_i, then you could use something like df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
> df_new[rowSums(apply(df_new, 2, is.na)) < 1, ]
word year mean_i
1 banana 2000 3
2 pear 2000 4
4 pear 2001 5
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
CodePudding user response:
Do you mean something like:
df %>%
filter(!(word %in% c('A', 'banana'))) %>%
group_by(year, word) %>%
summarize(Mean =mean(i))
Output:
# A tibble: 4 × 3
# Groups: year [2]
year word Mean
<dbl> <chr> <dbl>
1 2000 apple 1
2 2000 pear 4
3 2001 apple 2
4 2001 pear 5
Update: Inspired by akrun I slightly improved the filter.
