I have the following table
| LoanID | Event_count | StartDateTime | EndDateTime | Frequency | Amount |
|---|---|---|---|---|---|
| 12 | 3 | 2020-09-01T00:00:00Z | 2020-12-01T00:00:00Z | Monthly | 120 |
| 99 | 4 | 2021-01-01T00:00:00Z | 2021-10-01T00:00:00Z | Quarterly | 50 |
Column definitions
- Event_count is the number of times a repayment is made.
- StartDateTime is the time of the first payment.
- EndDateTime is the date of the last payment.
- Frequency is the interval of payment.
- Amount is the sum that is paid back each time.
How do I transform this to the format below? (without using loops as they are not supported by Databricks Spark SQL)
| Date | LoanID | RepaymentAmount | RepaymentNumber |
|---|---|---|---|
| 2020-09-01 | 12 | 120 | 1 |
| 2020-10-01 | 12 | 120 | 2 |
| 2020-11-01 | 12 | 120 | 3 |
| 2021-01-01 | 99 | 50 | 1 |
| 2021-04-01 | 99 | 50 | 2 |
| 2021-07-01 | 99 | 50 | 3 |
| 2021-10-01 | 99 | 50 | 4 |
CodePudding user response:
As per today , Databricks SQL is being solved as federated query engine for most of the use cases . So this is not being used for any transformation use case . Thats why Spark SQL support is for basic SQL queries only . So procedure-oriented queries (Like for, while) are not supported on Spark. This would fall under a new feature request and may be implemented in future. You should handle through pyspark or spark SQL .
CodePudding user response:
You should be able to use the range function with a CROSS JOIN, a simple example:
%sql
SELECT *
FROM tmp t
CROSS JOIN range(1, 99) r
WHERE r.id <= t.Event_count
Just a word of caution on the event count - the second argument to range should be the max number of events you can have, I've just guessed at 99 here, plus cross joins can be slow - please test with your data
