I have two dataframe(df1, df2) with different size . The df2 is subset of df1.
Date columns are formatted as date (y.m.d). (Location and location_id have a one to one mapping.)
tl;dr my goal is to use df2’s date to filter any date that comes after that for its specific location. df1
| date | location | location_id | y | x |
|---|---|---|---|---|
| 22.02.02 | A | 1 | 26 | 78 |
| 22.02.02 | B | 2 | 45 | 67 |
| 22.02.02 | C | 3 | 46 | 57 |
| 22.02.02 | D | 4 | 27 | 89 |
| 22.02.01 | A | 1 | 37 | 67 |
| 22.02.01 | B | 2 | 82 | 23 |
| 22.02.01 | C | 3 | 56 | 68 |
| 22.02.01 | D | 4 | 67 | 90 |
| 22.01.31 | A | 1 | 61 | 37 |
| 22.01.31 | B | 2 | 90 | 65 |
| 22.01.31 | C | 3 | 43 | 34 |
| 22.01.31 | D | 4 | 12 | 23 |
| 22.01.30 | A | 1 | 38 | 48 |
| 22.01.30 | B | 2 | 57 | 53 |
| 22.01.30 | C | 3 | 75 | 95 |
| 22.01.30 | D | 4 | 76 | 19 |
df2
| date | location | location_id | y | x |
|---|---|---|---|---|
| 22.02.01 | A | 1 | 37 | 67 |
| 22.02.02 | B | 2 | 45 | 67 |
| 22.01.30 | C | 3 | 75 | 95 |
| 22.01.31 | D | 4 | 12 | 23 |
In df2 each location corresponds to one and only one date. I want to use this location to date map from df2, and apply a function to df1. This function should set x and y columns to zero (or na) for any date that comes after specified date for that location on df2.
So the output df3 would be like this:
df3
| date | location | location_id | y | x |
|---|---|---|---|---|
| 22.02.02 | A | 1 | ||
| 22.02.02 | B | 2 | 45 | 67 |
| 22.02.02 | C | 3 | ||
| 22.02.02 | D | 4 | ||
| 22.02.01 | A | 1 | 37 | 67 |
| 22.02.01 | B | 2 | 82 | 23 |
| 22.02.01 | C | 3 | ||
| 22.02.01 | D | 4 | ||
| 22.01.31 | A | 1 | 61 | 37 |
| 22.01.31 | B | 2 | 90 | 65 |
| 22.01.31 | C | 3 | ||
| 22.01.31 | D | 4 | 12 | 23 |
| 22.01.30 | A | 1 | 38 | 48 |
| 22.01.30 | B | 2 | 57 | 53 |
| 22.01.30 | C | 3 | 75 | 95 |
| 22.01.30 | D | 4 | 76 | 19 |
I do not know how to solve this kind of questions, what is the most efficient way to solve this, is it using dplyr or for loop?! I am approaching this with using dplyr ,groupby and mutate. Something Like this (I do not get result that I want).
Here my code which does not give me the result:
df1 %>%
group_by(location, id_location) %>%
mutate(date = df2$date, y= df2$y, x = df2$x)
I am new to R and using dplyr.
CodePudding user response:
Here's one approach. First, I make a version of df2 with the dates stored as dates, which'll make it simpler to use them for calculations, and call that date_limits. (It's not strictly necessary here since your date strings' alphabetical sorting will also be chronological, but I think it's good practice.) I don't need the x/y values since they're in df1 already.
library(tidyverse); library(lubridate)
date_limits <- df2 %>%
mutate(max_date = ymd(date)) %>%
select(max_date, location, location_id)
Then we can join those dates onto df1 using dplyr::left_join, sort of like vlookup in excel, or merge in base R. It will by default use all the common variables (in this case location and location_id) to bring in the max_date for that location.
Then I change y and x using mutate(across(... so that if the max_date we pulled in is later than the date, change it to NA, otherwise leave it as is.
df1 %>%
mutate(date = ymd(date)) %>%
left_join(date_limits) %>%
mutate(across(y:x, ~if_else(date > max_date, NA_integer_, .)))
Result
Joining, by = c("location", "location_id")
date location location_id y x max_date
1 2022-02-02 A 1 NA NA 2022-02-01
2 2022-02-02 B 2 45 67 2022-02-02
3 2022-02-02 C 3 NA NA 2022-01-30
4 2022-02-02 D 4 NA NA 2022-01-31
5 2022-02-01 A 1 37 67 2022-02-01
6 2022-02-01 B 2 82 23 2022-02-02
7 2022-02-01 C 3 NA NA 2022-01-30
8 2022-02-01 D 4 NA NA 2022-01-31
9 2022-01-31 A 1 61 37 2022-02-01
10 2022-01-31 B 2 90 65 2022-02-02
11 2022-01-31 C 3 NA NA 2022-01-30
12 2022-01-31 D 4 12 23 2022-01-31
13 2022-01-30 A 1 38 48 2022-02-01
14 2022-01-30 B 2 57 53 2022-02-02
15 2022-01-30 C 3 75 95 2022-01-30
16 2022-01-30 D 4 76 19 2022-01-31
