Pyspark add new column of period starting year and handle leap years-CodePudding

I have a data frame and I am trying to add a column with the target_date's period starting date. But I am getting null due to the leap year starting date. Appreciate your help here.

 ----- ---------- ---------- ------------ ------- ----------------------- 
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|
 ----- ---------- ---------- ------------ ------- ----------------------- 
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|
 ----- ---------- ---------- ------------ ------- -----------------------

Expected output:

 ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date| 
 ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|          2021-02-28|
 ----- ---------- ---------- ------------ ------- ----------------------- --------------------

I tried the below code and did not get the correct output.

df.withColumn("period_starting_date", F.concat(F.col('target_date_fiscal_year'),
 F.substring(F.col("start_date"), -6, 6)).cast('date')).show()

 ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date| 
 ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|                null|
 ----- ---------- ---------- ------------ ------- ----------------------- --------------------

CodePudding user response：

You can calculate the difference between target_date_fiscal_year and year of start_date, then add the result to start_date to get the period_starting_date:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "period_starting_date",
    F.to_date("start_date")   F.format_string(
        "interval %s year", F.col("target_date_fiscal_year") - F.year("start_date")
    ).cast("interval")
)

df1.show()

# ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
#|   id|start_date|  end_date|target_date_|period_|target_date_fiscal_year|period_starting_date|
# ----- ---------- ---------- ------------ ------- ----------------------- -------------------- 
#|34667|2017-12-30|2022-12-30|  2021-11-30|      5|                   2020|          2020-12-30|
#|47353|2020-02-10|2023-02-10|  2021-11-30|      3|                   2021|          2021-02-10|
#|94773|2017-04-15|2022-04-15|  2021-11-30|      5|                   2021|          2021-04-15|
#|67324|2017-11-25|2022-11-25|  2021-11-30|      5|                   2021|          2021-11-25|
#|45688|2020-02-29|2025-02-28|  2021-11-30|      5|                   2021|          2021-02-28|
# ----- ---------- ---------- ------------ ------- ----------------------- --------------------

CodePudding user response：

In Python there's a beautiful package called dateutil, which can help you with your issue.

NOTE: You did not add the code, hence, can't check if this is 100% correct.

from dateutil.relativedelta import relativedelta

def delta_creator(df):
    delta = df['target_date_fiscal_year'] - df['start_date'].dt.year 
    df['period_starting_date'] = df['start_date']   relativedelta(years=delta)
    return df
    
df = df.apply(delta_creator, axis=1)