Are there any algorithms to quickly iterate through 500k rows and get the number of values smaller t-CodePudding

I'm trying to create an algorithm in R that can iterate through a long form data frame and for each 'value', look through the full data frame, create a subset based on the 'Date' column for the past fiscal year, and then get the % of values for that variable that are below it. And then get the % below specific to that industry as well.

My data is currently structured in this format:

Date	Industry	variable	value
2022-02-01	Industry1	variable1	43.2
2022-04-01	Industry1	variable2	17.8
2022-06-01	Industry2	variable3	22.5
2021-12-01	Industry3	variable1	98.3

My code is currently below. It works fine and gets me the desired values but it runs very slowly and would take hours to run through the total 500k rows. I'm wondering if there is maybe some sorting algorithm that could produce a quicker result?

Any help would be very appreciated. Thank you!

for (row in 1:nrow(df)) {
  checkDate <- df[row, "Date"]
  industry <- df[row, "Industry"]
  variable <- df[row, "variable"]
  val <- as.numeric(df[row, "value"])
  filtered_df <- df[which(df$variable == variable), ]
  fiscal_year <- filtered_df[which(filtered_df$Date <= checkDate & 
                                     filtered_df$Date >= (checkDate - 366)), ]
  industry_df <- fiscal_year[which(fiscal_year$Industry == industry), ]
  df[row, "Total"] <- length(which(fiscal_year$value < val))/
    length(fiscal_year$value)
  df[row, "Industry"] <- length(which(industry_df$value < val))/
    length(industry_df$value)
}

CodePudding user response：

I don't have R handy, and it will take a lot of work, but I'll tell you the strategy.

First, sort by variable, then checkDate.

Now start with 3 variables, row, row_min and row_max. And also take something like rbst (it is the right data structure) and modify it to keep track of counts of the size of its left and right tree structure in addition to the tree structures. From this extra data, you an create a O(log(n)) lookup to count how many things in your tree are less than or equal to a particular amount.

And then in pseudocode:

set row, row_min, and row_max to the start of the data frame
create your tree, sorted by value, with only row in it
while True:
    while row_max.variable == row.variable
      and row_max.next() exists
      and row_max.next().variable == row.variable
      and row_max.next().checkDate == row.checkDate:
        insert row_max into tree
        row_max = row_max.next()
    while row_min.variable != row.variable:
        delete row_min from tree
        row_min = row_min.next()
    while row_min.checkDate < row.checkDate - 266:
        delete row_min from tree
        row_min = row_min.next()
    row.total = (count of things in tree below row.value) / (position of row_max - position of row_min   1)
    if row.next() exists:
        row = row.next()
    else
        break # EXIT LOOP HERE!

You'll then need to sort by industry, variable, and checkDate, repeat the logic (except with checks on industry in addition to variable), and calculate your Industry calculation.

This is a lot of work but will take the calculation from O(n^2) to O(n log(n)).

CodePudding user response：

I don't have time to provide a total solution. But you may want to consider using the ecdf (empirical cumulative distribution function) to generate the percentile values. Here is an example using mtcars with mpg as the target metric:

cars <- mtcars
Fn <- ecdf(cars$mpg)
cars$ecdf <- Fn(cars$mpg)
head(cars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb    ecdf
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.62500
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.62500
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.78125
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.68750
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.46875
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.43750

The ecdf column in the dataframe is the fraction of mpg values that are below the row mpg value.

Your complete solution would group_by extracted year AND variable and also group_by year AND Industry AND variable, applying the ecdf function to the value column for each cluster of groups.