I am running queries against an SQL Server through R trying to get the sum of values per columns "ID" and "ProductCode" for the last 60 days by grouping. The total expected number of rows is ID * ProductCode * Dates, e.g. 33 IDs * 76.000 ProductCodes * 60 Dates, resulting in around 150 million rows. Do you have any suggestions on how to handle such amounts of data as this needs at least 1 hour to execute?
for(date in 1:nrow(dates)) {
mydf <- dbGetQuery(con,
paste("SELECT ID, ProductCode, sum(Value) AS Value
FROM DBO.Value
WHERE Number NOT IN (100) AND ",
"'", dates[date, 1], "'", "BETWEEN ValidFrom AND ISNULL(ValidTo, '2099-12-31')
Group by ID, ProductCode", sep = ""))
if(nrow(mydf) != 0) {
mydf$Date <- dates[date, 1]
result <- rbind(result, mydf)
} else {
next
}
}
CodePudding user response:
You could create a temporary table in SQL Server (60 rows according to your post).
The date column shouldn't be called date because this would interfere with SQL syntax, let's assume it's called EventDate :
library(DBI)
con <- dbConnect(...)
# Create temporary table
dbWriteTable(con,'#DATES',dates)
mydf <- dbGetQuery(con,
"SELECT EventDate, ID, ProductCode, sum(Value) AS Value
FROM DBO.Value,#DATES
WHERE Number NOT IN (100) AND
EventDate BETWEEN ValidFrom AND ISNULL(ValidTo, '2099-12-31')
Group by EventDate,ID, ProductCode")
# Remove temporary table
dbExecute(con,"DROP TABLE #DATES" )
CodePudding user response:
I think this can be done in a single query, and a bit more efficiently. My trick here is to upload (into a temporary table) the dates you want, then do a join on a date range.
Reproducible sample data
I'll start by generating some fake data. Note that I confound ID and ProductCode here, but it is only to have both columns in this sample. The query should still work as-is.
You should not need to run this portion yourself unless you want to play with "my data". Otherwise, change my mytable to DBO.Value (and perhaps other changes I missed).
set.seed(42)
n <- 1000
mytable <- transform(
data.frame(
ID = sample(LETTERS[1:7], size = n, replace = TRUE),
Value = sample(1e3, size = n, replace = TRUE),
ValidFrom = Sys.Date() - sample(365, size = n, replace = TRUE)
),
ProductCode = ID,
ValidTo = ValidFrom sample(99, size = n, replace = TRUE))
mytable$ValidTo[runif(n) > 0.9] <- NA
head(mytable)
# ID Value ValidFrom ProductCode ValidTo
# 1 A 190 2022-01-26 A 2022-04-25
# 2 E 792 2021-02-08 E <NA>
# 3 A 293 2022-01-23 A 2022-04-26
# 4 A 314 2022-01-19 A 2022-04-13
# 5 B 339 2021-03-15 B 2021-04-09
# 6 D 596 2021-02-12 D 2021-05-01
# conss <- DBI::dbConnect(...)
DBI::dbExecute(conss, "drop table if exists mytable")
DBI::dbWriteTable(conss, "mytable", mytable, create = TRUE)
DBI::dbGetQuery(conss, "select top 3 * from mytable")
Now I have a table mytable on the database with 1000 rows.
Single query
The first step for simplicity is to put the table of 60 dates into a temporary table. I'll call it #mydates. (The # in SQL Server denotes a temporary table that will be automatically removed when this connection is closed. This is not required, but a good practice ... and it will keep your DBAs happy.)
(This part you do need to do, it is not part of my reproducible sample. That is, using your dates, not mine.)
mydates <- data.frame(date = Sys.Date() - 0:59)
head(mydates)
# date
# 1 2022-01-31
# 2 2022-01-30
# 3 2022-01-29
# 4 2022-01-28
# 5 2022-01-27
# 6 2022-01-26
DBI::dbExecute(conss, "drop table if exists #mydates")
# [1] 0
DBI::dbWriteTable(conss, "#mydates", mydates, create = TRUE)
From here, let's run a single query:
ret <- DBI::dbGetQuery(conss,
"select myd.date,
myt.ID, myt.ProductCode, sum(myt.Value) as Value
from #mydates myd
left join mytable myt on
(myd.date between myt.ValidFrom and myt.ValidTo
or
myt.ValidFrom <= myd.date and myt.ValidTo is null)
group by myt.ID, myt.ProductCode, myd.date")
nrow(ret)
# [1] 420
head(ret)
# date ID ProductCode Value
# 1 2021-12-03 A A 14563
# 2 2021-12-04 A A 14563
# 3 2021-12-05 A A 14563
# 4 2021-12-06 A A 14557
# 5 2021-12-07 A A 14963
# 6 2021-12-08 A A 14963
This looks right, as the number of rows returned is what we expect and your description of how many rows you expect out of this:
length(unique(mytable$ID)) * # currently same as $ProductCode
nrow(mydates)
# [1] 420
Verification/Validation
We can do the aggregation here in R to verify that the results are as expected. Because this is a range-join, I will use data.table to demonstrate; dplyr fuzzyjoin or sqldf can do it, it's a bit more difficult in base R.
library(data.table)
setDT(mytable)
setDT(mydates)
ret2 <- mydates[, origdate := date # b/c data.table overwrites 'date' with 'ValidFrom' in the join
][ mytable, on = .(date >= ValidFrom)
][ is.na(ValidTo) | origdate <= ValidTo,
][, .(Value = sum(Value)), by = .(origdate, ID, ProductCode) ]
setorder(ret2, origdate, ID)
setnames(ret2, "origdate", "date")
all.equal(ret, ret2, check.attributes = FALSE)
# [1] TRUE
cbind(head(ret), head(ret2))
# date ID ProductCode Value date ID ProductCode Value
# 1 2021-12-03 A A 14563 2021-12-03 A A 14563
# 2 2021-12-03 B B 19062 2021-12-03 B B 19062
# 3 2021-12-03 C C 13993 2021-12-03 C C 13993
# 4 2021-12-03 D D 15836 2021-12-03 D D 15836
# 5 2021-12-03 E E 17181 2021-12-03 E E 17181
# 6 2021-12-03 F F 9857 2021-12-03 F F 9857
PS: download speeds
FYI, downloading 150 million rows over DBI is going to be rather slow. I strongly suggest using sqlcmd or bcp to do a bulk-query, as your download times will be significantly shorter. Crafting that command-line is a bit out-of-scope for this answer, but the docs are fairly decent. See https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-migrate-bcp.
