Handling 150 million rows with R and SQL-CodePudding

I am running queries against an SQL Server through R trying to get the sum of values per columns "ID" and "ProductCode" for the last 60 days by grouping. The total expected number of rows is ID * ProductCode * Dates, e.g. 33 IDs * 76.000 ProductCodes * 60 Dates, resulting in around 150 million rows. Do you have any suggestions on how to handle such amounts of data as this needs at least 1 hour to execute?

    for(date in 1:nrow(dates)) {
      mydf <- dbGetQuery(con, 
             paste("SELECT ID, ProductCode, sum(Value) AS Value
             FROM DBO.Value
             WHERE Number NOT IN (100) AND ", 
             "'", dates[date, 1], "'", "BETWEEN ValidFrom AND ISNULL(ValidTo, '2099-12-31')
             Group by ID, ProductCode", sep = ""))
      
      if(nrow(mydf) != 0) {
         mydf$Date <- dates[date, 1]
         result <- rbind(result, mydf)
      } else {
         next
      }
}

CodePudding user response：

You could create a temporary table in SQL Server (60 rows according to your post).

The date column shouldn't be called date because this would interfere with SQL syntax, let's assume it's called EventDate :

library(DBI)
con <- dbConnect(...)

# Create temporary table
dbWriteTable(con,'#DATES',dates)

mydf <- dbGetQuery(con, 
             "SELECT EventDate, ID, ProductCode, sum(Value) AS Value
             FROM DBO.Value,#DATES
             WHERE Number NOT IN (100) AND 
             EventDate BETWEEN ValidFrom AND ISNULL(ValidTo, '2099-12-31')
             Group by EventDate,ID, ProductCode")

# Remove temporary table
dbExecute(con,"DROP TABLE #DATES" )

CodePudding user response：

I think this can be done in a single query, and a bit more efficiently. My trick here is to upload (into a temporary table) the dates you want, then do a join on a date range.

Reproducible sample data

I'll start by generating some fake data. Note that I confound ID and ProductCode here, but it is only to have both columns in this sample. The query should still work as-is.

You should not need to run this portion yourself unless you want to play with "my data". Otherwise, change my mytable to DBO.Value (and perhaps other changes I missed).

set.seed(42)
n <- 1000
mytable <- transform(
  data.frame(
    ID = sample(LETTERS[1:7], size = n, replace = TRUE),
    Value = sample(1e3, size = n, replace = TRUE),
    ValidFrom = Sys.Date() - sample(365, size = n, replace = TRUE)
  ),
  ProductCode = ID,
  ValidTo = ValidFrom   sample(99, size = n, replace = TRUE))
mytable$ValidTo[runif(n) > 0.9] <- NA
head(mytable)
#   ID Value  ValidFrom ProductCode    ValidTo
# 1  A   190 2022-01-26           A 2022-04-25
# 2  E   792 2021-02-08           E       <NA>
# 3  A   293 2022-01-23           A 2022-04-26
# 4  A   314 2022-01-19           A 2022-04-13
# 5  B   339 2021-03-15           B 2021-04-09
# 6  D   596 2021-02-12           D 2021-05-01

# conss <- DBI::dbConnect(...)
DBI::dbExecute(conss, "drop table if exists mytable")
DBI::dbWriteTable(conss, "mytable", mytable, create = TRUE)
DBI::dbGetQuery(conss, "select top 3 * from mytable")

Now I have a table mytable on the database with 1000 rows.

Single query

The first step for simplicity is to put the table of 60 dates into a temporary table. I'll call it #mydates. (The # in SQL Server denotes a temporary table that will be automatically removed when this connection is closed. This is not required, but a good practice ... and it will keep your DBAs happy.)

(This part you do need to do, it is not part of my reproducible sample. That is, using your dates, not mine.)

mydates <- data.frame(date = Sys.Date() - 0:59)
head(mydates)
#         date
# 1 2022-01-31
# 2 2022-01-30
# 3 2022-01-29
# 4 2022-01-28
# 5 2022-01-27
# 6 2022-01-26
DBI::dbExecute(conss, "drop table if exists #mydates")
# [1] 0
DBI::dbWriteTable(conss, "#mydates", mydates, create = TRUE)

From here, let's run a single query:

ret <- DBI::dbGetQuery(conss,
  "select myd.date,
     myt.ID, myt.ProductCode, sum(myt.Value) as Value
   from #mydates myd
     left join mytable myt on
       (myd.date between myt.ValidFrom and myt.ValidTo
          or
        myt.ValidFrom <= myd.date and myt.ValidTo is null)
   group by myt.ID, myt.ProductCode, myd.date")

nrow(ret)
# [1] 420

head(ret)
#         date ID ProductCode Value
# 1 2021-12-03  A           A 14563
# 2 2021-12-04  A           A 14563
# 3 2021-12-05  A           A 14563
# 4 2021-12-06  A           A 14557
# 5 2021-12-07  A           A 14963
# 6 2021-12-08  A           A 14963

This looks right, as the number of rows returned is what we expect and your description of how many rows you expect out of this:

length(unique(mytable$ID)) * # currently same as $ProductCode
  nrow(mydates)
# [1] 420

Verification/Validation

We can do the aggregation here in R to verify that the results are as expected. Because this is a range-join, I will use data.table to demonstrate; dplyr fuzzyjoin or sqldf can do it, it's a bit more difficult in base R.

library(data.table)
setDT(mytable)
setDT(mydates)

ret2 <- mydates[, origdate := date   # b/c data.table overwrites 'date' with 'ValidFrom' in the join
                ][ mytable, on = .(date >= ValidFrom)
                  ][ is.na(ValidTo) | origdate <= ValidTo,
                    ][, .(Value = sum(Value)), by = .(origdate, ID, ProductCode) ]
setorder(ret2, origdate, ID)
setnames(ret2, "origdate", "date")

all.equal(ret, ret2, check.attributes = FALSE)
# [1] TRUE
cbind(head(ret), head(ret2))
#         date ID ProductCode Value       date ID ProductCode Value
# 1 2021-12-03  A           A 14563 2021-12-03  A           A 14563
# 2 2021-12-03  B           B 19062 2021-12-03  B           B 19062
# 3 2021-12-03  C           C 13993 2021-12-03  C           C 13993
# 4 2021-12-03  D           D 15836 2021-12-03  D           D 15836
# 5 2021-12-03  E           E 17181 2021-12-03  E           E 17181
# 6 2021-12-03  F           F  9857 2021-12-03  F           F  9857

PS: download speeds

FYI, downloading 150 million rows over DBI is going to be rather slow. I strongly suggest using sqlcmd or bcp to do a bulk-query, as your download times will be significantly shorter. Crafting that command-line is a bit out-of-scope for this answer, but the docs are fairly decent. See https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp.