Home > OS >  Updating rows of dataframe with other dataframe column vlaue for each group in R
Updating rows of dataframe with other dataframe column vlaue for each group in R

Time:02-05

I have two dataframe(df1, df2) with different size . The df2 is subset of df1.

Date columns are formatted as date (y.m.d). (Location and location_id have a one to one mapping.)

tl;dr my goal is to use df2’s date to filter any date that comes after that for its specific location. df1

date location location_id y x
22.02.02 A 1 26 78
22.02.02 B 2 45 67
22.02.02 C 3 46 57
22.02.02 D 4 27 89
22.02.01 A 1 37 67
22.02.01 B 2 82 23
22.02.01 C 3 56 68
22.02.01 D 4 67 90
22.01.31 A 1 61 37
22.01.31 B 2 90 65
22.01.31 C 3 43 34
22.01.31 D 4 12 23
22.01.30 A 1 38 48
22.01.30 B 2 57 53
22.01.30 C 3 75 95
22.01.30 D 4 76 19

df2

date location location_id y x
22.02.01 A 1 37 67
22.02.02 B 2 45 67
22.01.30 C 3 75 95
22.01.31 D 4 12 23

In df2 each location corresponds to one and only one date. I want to use this location to date map from df2, and apply a function to df1. This function should set x and y columns to zero (or na) for any date that comes after specified date for that location on df2.

So the output df3 would be like this:

df3

date location location_id y x
22.02.02 A 1
22.02.02 B 2 45 67
22.02.02 C 3
22.02.02 D 4
22.02.01 A 1 37 67
22.02.01 B 2 82 23
22.02.01 C 3
22.02.01 D 4
22.01.31 A 1 61 37
22.01.31 B 2 90 65
22.01.31 C 3
22.01.31 D 4 12 23
22.01.30 A 1 38 48
22.01.30 B 2 57 53
22.01.30 C 3 75 95
22.01.30 D 4 76 19

I do not know how to solve this kind of questions, what is the most efficient way to solve this, is it using dplyr or for loop?! I am approaching this with using dplyr ,groupby and mutate. Something Like this (I do not get result that I want).

Here my code which does not give me the result:

df1 %>%

group_by(location, id_location) %>%

mutate(date =  df2$date, y= df2$y, x = df2$x)

I am new to R and using dplyr.

CodePudding user response:

Here's one approach. First, I make a version of df2 with the dates stored as dates, which'll make it simpler to use them for calculations, and call that date_limits. (It's not strictly necessary here since your date strings' alphabetical sorting will also be chronological, but I think it's good practice.) I don't need the x/y values since they're in df1 already.

library(tidyverse); library(lubridate)
date_limits <- df2 %>%
  mutate(max_date = ymd(date)) %>%
  select(max_date, location, location_id)

Then we can join those dates onto df1 using dplyr::left_join, sort of like vlookup in excel, or merge in base R. It will by default use all the common variables (in this case location and location_id) to bring in the max_date for that location.

Then I change y and x using mutate(across(... so that if the max_date we pulled in is later than the date, change it to NA, otherwise leave it as is.

df1 %>% 
  mutate(date = ymd(date)) %>%
  left_join(date_limits) %>%
  mutate(across(y:x, ~if_else(date > max_date, NA_integer_, .)))

Result

Joining, by = c("location", "location_id")
         date location location_id  y  x   max_date
1  2022-02-02        A           1 NA NA 2022-02-01
2  2022-02-02        B           2 45 67 2022-02-02
3  2022-02-02        C           3 NA NA 2022-01-30
4  2022-02-02        D           4 NA NA 2022-01-31
5  2022-02-01        A           1 37 67 2022-02-01
6  2022-02-01        B           2 82 23 2022-02-02
7  2022-02-01        C           3 NA NA 2022-01-30
8  2022-02-01        D           4 NA NA 2022-01-31
9  2022-01-31        A           1 61 37 2022-02-01
10 2022-01-31        B           2 90 65 2022-02-02
11 2022-01-31        C           3 NA NA 2022-01-30
12 2022-01-31        D           4 12 23 2022-01-31
13 2022-01-30        A           1 38 48 2022-02-01
14 2022-01-30        B           2 57 53 2022-02-02
15 2022-01-30        C           3 75 95 2022-01-30
16 2022-01-30        D           4 76 19 2022-01-31
  •  Tags:  
  • Related