I have an R dataframe such as:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
I want to obtain the difference between each value of Condition (A - B) for each Variable and ID while keeping the long format. That would mean the value must appear every two rows, like this:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
So far, I managed to do something relatively similar using the dplyr package, but it does not work if I want to maintain the long format:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3
CodePudding user response:
You dont need to create lag variable just use Value[Condition == "A"] - Value[Condition == "B"] as below
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
CodePudding user response:
You don't have to use lag, but use diff:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
Output:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
CodePudding user response:
This should work:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
Notice that this solution just fits the specifics of your problem, and may be neither elegant nor efficient.
I wrote comments in the code to explain how this solution works. Anyway, the idea is to first write the function temp.fun(), which operates on single rows: for each row we pass, it finds df$Value of the row satisfying the criteria you asked (same ID, same Variable, different Condition). Then, we use apply() to pass all rows in temp.fun(), thus creating a new column in df storing the Value mentioned above.
We are now ready to compute df$diff_value. First, we initialize space, creating a column on NA. Then, we perform the operations. Be careful: because of the specifics of the problem, if Condition equals A, we want to subtract values, whether when Condition equals B we are going to add values. That is, in the former case we compute df$Value - df$corr_value, and in the latter we compute df$corr_value- df$Value.
Final warning: if Condition is not binary, this solution must be generalized in order to work.
