Home > Back-end >  Difference between rows in long format for R based on other column variables
Difference between rows in long format for R based on other column variables

Time:01-11

I have an R dataframe such as:

df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4), 
                        Variable = c(rep("X", 4), rep("Y", 4)),
                        Value = c(3, 5, 6, 6, 3, 8, 3, 6))

  ID Condition Variable Value
1  1         A        X     3
2  1         B        X     5
3  2         A        X     6
4  2         B        X     6
5  1         A        Y     3
6  1         B        Y     8
7  2         A        Y     3
8  2         B        Y     6

I want to obtain the difference between each value of Condition (A - B) for each Variable and ID while keeping the long format. That would mean the value must appear every two rows, like this:

 ID Condition Variable Value diff_value
1  1         A        X     3         -2
2  1         B        X     5         -2
3  2         A        X     6          0
4  2         B        X     6          0
5  1         A        Y     3         -5
6  1         B        Y     8         -5
7  2         A        Y     3         -3
8  2         B        Y     6         -3

So far, I managed to do something relatively similar using the dplyr package, but it does not work if I want to maintain the long format:

df_long_example %>%
  group_by(Variable, ID) %>%
  mutate(diff_value = lag(Value, default = Value[1]) -Value)

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3          0
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3          0
6     1 B         Y            8         -5
7     2 A         Y            3          0
8     2 B         Y            6         -3

CodePudding user response:

You dont need to create lag variable just use Value[Condition == "A"] - Value[Condition == "B"] as below

df %>% 
  group_by(ID, Variable) %>%
  mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])

# A tibble: 8 x 5
# Groups:   ID, Variable [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3         -2
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3         -5
6     1 B         Y            8         -5
7     2 A         Y            3         -3
8     2 B         Y            6         -3

CodePudding user response:

You don't have to use lag, but use diff:

df %>% 
  group_by(Variable,ID) %>% 
  mutate(diff = -diff(Value))

Output:

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value  diff
  <dbl> <chr>     <chr>    <dbl> <dbl>
1     1 A         X            3    -2
2     1 B         X            5    -2
3     2 A         X            6     0
4     2 B         X            6     0
5     1 A         Y            3    -5
6     1 B         Y            8    -5
7     2 A         Y            3    -3
8     2 B         Y            6    -3

CodePudding user response:

This should work:

# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").

temp.fun = function(x, dta)
{
  # Given a row x of dta, this function selects the value corresponding to the row
  # with same "ID", same "Variable" and different "Condition".
  
  # Notice that if "Condition" is not binary, we need to generalize this function.
  
  # Notice also that this function is super specific to your case, and that it has
  # been thought to be used within apply().
  
  # INPUTS:
  #   - x, a row of a data frame.
  #   - dta, the data frame (df, in your case).
  
  # OUTPUT:
  #   - temp.corresponding, "Value" you want for each row.
  
  
  # Saving information.
  temp.id = as.numeric(x["ID"])
  temp.condition = as.character(x["Condition"])
  temp.variable = as.character(x["Variable"])
  
  # Index for selecting row.
  temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
  
  # Selecting "Value".
  temp.corresponding = dta$Value[temp.row]
  
  return(temp.corresponding)
}

df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)

# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.

df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]

Notice that this solution just fits the specifics of your problem, and may be neither elegant nor efficient.

I wrote comments in the code to explain how this solution works. Anyway, the idea is to first write the function temp.fun(), which operates on single rows: for each row we pass, it finds df$Value of the row satisfying the criteria you asked (same ID, same Variable, different Condition). Then, we use apply() to pass all rows in temp.fun(), thus creating a new column in df storing the Value mentioned above.

We are now ready to compute df$diff_value. First, we initialize space, creating a column on NA. Then, we perform the operations. Be careful: because of the specifics of the problem, if Condition equals A, we want to subtract values, whether when Condition equals B we are going to add values. That is, in the former case we compute df$Value - df$corr_value, and in the latter we compute df$corr_value- df$Value.

Final warning: if Condition is not binary, this solution must be generalized in order to work.

  •  Tags:  
  • Related