I have a data frame and I am trying to add a column with the target_date's period starting date. But I am getting null due to the leap year starting date. Appreciate your help here.
----- ---------- ---------- ------------ ------- -----------------------
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|
----- ---------- ---------- ------------ ------- -----------------------
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021|
----- ---------- ---------- ------------ ------- -----------------------
Expected output:
----- ---------- ---------- ------------ ------- ----------------------- --------------------
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
----- ---------- ---------- ------------ ------- ----------------------- --------------------
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
----- ---------- ---------- ------------ ------- ----------------------- --------------------
I tried the below code and did not get the correct output.
df.withColumn("period_starting_date", F.concat(F.col('target_date_fiscal_year'),
F.substring(F.col("start_date"), -6, 6)).cast('date')).show()
----- ---------- ---------- ------------ ------- ----------------------- --------------------
| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
----- ---------- ---------- ------------ ------- ----------------------- --------------------
|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| null|
----- ---------- ---------- ------------ ------- ----------------------- --------------------
CodePudding user response:
You can calculate the difference between target_date_fiscal_year and year of start_date, then add the result to start_date to get the period_starting_date:
from pyspark.sql import functions as F
df1 = df.withColumn(
"period_starting_date",
F.to_date("start_date") F.format_string(
"interval %s year", F.col("target_date_fiscal_year") - F.year("start_date")
).cast("interval")
)
df1.show()
# ----- ---------- ---------- ------------ ------- ----------------------- --------------------
#| id|start_date| end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
# ----- ---------- ---------- ------------ ------- ----------------------- --------------------
#|34667|2017-12-30|2022-12-30| 2021-11-30| 5| 2020| 2020-12-30|
#|47353|2020-02-10|2023-02-10| 2021-11-30| 3| 2021| 2021-02-10|
#|94773|2017-04-15|2022-04-15| 2021-11-30| 5| 2021| 2021-04-15|
#|67324|2017-11-25|2022-11-25| 2021-11-30| 5| 2021| 2021-11-25|
#|45688|2020-02-29|2025-02-28| 2021-11-30| 5| 2021| 2021-02-28|
# ----- ---------- ---------- ------------ ------- ----------------------- --------------------
CodePudding user response:
In Python there's a beautiful package called dateutil, which can help you with your issue.
NOTE: You did not add the code, hence, can't check if this is 100% correct.
from dateutil.relativedelta import relativedelta
def delta_creator(df):
delta = df['target_date_fiscal_year'] - df['start_date'].dt.year
df['period_starting_date'] = df['start_date'] relativedelta(years=delta)
return df
df = df.apply(delta_creator, axis=1)
