I'm trying to create an algorithm in R that can iterate through a long form data frame and for each 'value', look through the full data frame, create a subset based on the 'Date' column for the past fiscal year, and then get the % of values for that variable that are below it. And then get the % below specific to that industry as well.
My data is currently structured in this format:
| Date | Industry | variable | value |
|---|---|---|---|
| 2022-02-01 | Industry1 | variable1 | 43.2 |
| 2022-04-01 | Industry1 | variable2 | 17.8 |
| 2022-06-01 | Industry2 | variable3 | 22.5 |
| 2021-12-01 | Industry3 | variable1 | 98.3 |
My code is currently below. It works fine and gets me the desired values but it runs very slowly and would take hours to run through the total 500k rows. I'm wondering if there is maybe some sorting algorithm that could produce a quicker result?
Any help would be very appreciated. Thank you!
for (row in 1:nrow(df)) {
checkDate <- df[row, "Date"]
industry <- df[row, "Industry"]
variable <- df[row, "variable"]
val <- as.numeric(df[row, "value"])
filtered_df <- df[which(df$variable == variable), ]
fiscal_year <- filtered_df[which(filtered_df$Date <= checkDate &
filtered_df$Date >= (checkDate - 366)), ]
industry_df <- fiscal_year[which(fiscal_year$Industry == industry), ]
df[row, "Total"] <- length(which(fiscal_year$value < val))/
length(fiscal_year$value)
df[row, "Industry"] <- length(which(industry_df$value < val))/
length(industry_df$value)
}
CodePudding user response:
I don't have R handy, and it will take a lot of work, but I'll tell you the strategy.
First, sort by variable, then checkDate.
Now start with 3 variables, row, row_min and row_max. And also take something like rbst (it is the right data structure) and modify it to keep track of counts of the size of its left and right tree structure in addition to the tree structures. From this extra data, you an create a O(log(n)) lookup to count how many things in your tree are less than or equal to a particular amount.
And then in pseudocode:
set row, row_min, and row_max to the start of the data frame
create your tree, sorted by value, with only row in it
while True:
while row_max.variable == row.variable
and row_max.next() exists
and row_max.next().variable == row.variable
and row_max.next().checkDate == row.checkDate:
insert row_max into tree
row_max = row_max.next()
while row_min.variable != row.variable:
delete row_min from tree
row_min = row_min.next()
while row_min.checkDate < row.checkDate - 266:
delete row_min from tree
row_min = row_min.next()
row.total = (count of things in tree below row.value) / (position of row_max - position of row_min 1)
if row.next() exists:
row = row.next()
else
break # EXIT LOOP HERE!
You'll then need to sort by industry, variable, and checkDate, repeat the logic (except with checks on industry in addition to variable), and calculate your Industry calculation.
This is a lot of work but will take the calculation from O(n^2) to O(n log(n)).
CodePudding user response:
I don't have time to provide a total solution. But you may want to consider using the ecdf (empirical cumulative distribution function) to generate the percentile values. Here is an example using mtcars with mpg as the target metric:
cars <- mtcars
Fn <- ecdf(cars$mpg)
cars$ecdf <- Fn(cars$mpg)
head(cars)
mpg cyl disp hp drat wt qsec vs am gear carb ecdf
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.62500
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.62500
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.78125
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.68750
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.46875
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.43750
The ecdf column in the dataframe is the fraction of mpg values that are below the row mpg value.
Your complete solution would group_by extracted year AND variable and also group_by year AND Industry AND variable, applying the ecdf function to the value column for each cluster of groups.
