Updating rows of dataframe with other dataframe column vlaue for each group in R-CodePudding

I have two dataframe(df1, df2) with different size . The df2 is subset of df1.

Date columns are formatted as date (y.m.d). (Location and location_id have a one to one mapping.)

tl;dr my goal is to use df2’s date to filter any date that comes after that for its specific location. df1

date	location	location_id	y	x
22.02.02	A	1	26	78
22.02.02	B	2	45	67
22.02.02	C	3	46	57
22.02.02	D	4	27	89
22.02.01	A	1	37	67
22.02.01	B	2	82	23
22.02.01	C	3	56	68
22.02.01	D	4	67	90
22.01.31	A	1	61	37
22.01.31	B	2	90	65
22.01.31	C	3	43	34
22.01.31	D	4	12	23
22.01.30	A	1	38	48
22.01.30	B	2	57	53
22.01.30	C	3	75	95
22.01.30	D	4	76	19

df2

date	location	location_id	y	x
22.02.01	A	1	37	67
22.02.02	B	2	45	67
22.01.30	C	3	75	95
22.01.31	D	4	12	23

In df2 each location corresponds to one and only one date. I want to use this location to date map from df2, and apply a function to df1. This function should set x and y columns to zero (or na) for any date that comes after specified date for that location on df2.

So the output df3 would be like this:

df3

date	location	location_id	y	x
22.02.02	A	1
22.02.02	B	2	45	67
22.02.02	C	3
22.02.02	D	4
22.02.01	A	1	37	67
22.02.01	B	2	82	23
22.02.01	C	3
22.02.01	D	4
22.01.31	A	1	61	37
22.01.31	B	2	90	65
22.01.31	C	3
22.01.31	D	4	12	23
22.01.30	A	1	38	48
22.01.30	B	2	57	53
22.01.30	C	3	75	95
22.01.30	D	4	76	19

I do not know how to solve this kind of questions, what is the most efficient way to solve this, is it using dplyr or for loop?! I am approaching this with using dplyr ,groupby and mutate. Something Like this (I do not get result that I want).

Here my code which does not give me the result:

df1 %>%

group_by(location, id_location) %>%

mutate(date =  df2$date, y= df2$y, x = df2$x)

I am new to R and using dplyr.

CodePudding user response：

Here's one approach. First, I make a version of df2 with the dates stored as dates, which'll make it simpler to use them for calculations, and call that date_limits. (It's not strictly necessary here since your date strings' alphabetical sorting will also be chronological, but I think it's good practice.) I don't need the x/y values since they're in df1 already.

library(tidyverse); library(lubridate)
date_limits <- df2 %>%
  mutate(max_date = ymd(date)) %>%
  select(max_date, location, location_id)

Then we can join those dates onto df1 using dplyr::left_join, sort of like vlookup in excel, or merge in base R. It will by default use all the common variables (in this case location and location_id) to bring in the max_date for that location.

Then I change y and x using mutate(across(... so that if the max_date we pulled in is later than the date, change it to NA, otherwise leave it as is.

df1 %>% 
  mutate(date = ymd(date)) %>%
  left_join(date_limits) %>%
  mutate(across(y:x, ~if_else(date > max_date, NA_integer_, .)))

Result

Joining, by = c("location", "location_id")
         date location location_id  y  x   max_date
1  2022-02-02        A           1 NA NA 2022-02-01
2  2022-02-02        B           2 45 67 2022-02-02
3  2022-02-02        C           3 NA NA 2022-01-30
4  2022-02-02        D           4 NA NA 2022-01-31
5  2022-02-01        A           1 37 67 2022-02-01
6  2022-02-01        B           2 82 23 2022-02-02
7  2022-02-01        C           3 NA NA 2022-01-30
8  2022-02-01        D           4 NA NA 2022-01-31
9  2022-01-31        A           1 61 37 2022-02-01
10 2022-01-31        B           2 90 65 2022-02-02
11 2022-01-31        C           3 NA NA 2022-01-30
12 2022-01-31        D           4 12 23 2022-01-31
13 2022-01-30        A           1 38 48 2022-02-01
14 2022-01-30        B           2 57 53 2022-02-02
15 2022-01-30        C           3 75 95 2022-01-30
16 2022-01-30        D           4 76 19 2022-01-31