I have a list of 106 tibbles, which are time series, within data_sensor. Every tibble has two columns with Date and temperature respectively.
On the other hand I have a list of 106 dates in date_admin which contain the dates where I want my time series to end by tibble.
The code works properly but using the nested for-loop it takes too much time as the average number of rows is close to 10000th per tibble.
library(tidyverse)
library(dplyr)
#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)
##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)
idxend=vector()
for (i in seq_along(data_sensor)){
for (j in seq_along(data_sensor[[i]][[1]])){
if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
} else{ #Convert all the elements after condition to NA's
data_sensor[[i]][[1]][[j]] = NA
data_sensor[[i]][[2]][[j]] = NA
}
}
#Drop all NA's
for (i in seq_along(data_sensor)){
data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}
In order to clarify my list of tibbles and vector:
> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC"
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
[1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
[6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
[11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
[16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
[21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
[26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
[31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
[36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
[41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
[46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
[51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
[56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
[61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
[66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
[71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
[76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
[81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
[86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
[91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
[96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"
data_sensor[[1]]
date Temperature
1 2018-08-07 11:00:31 6.3
2 2018-08-07 11:10:31 11.4
3 2018-08-07 11:20:31 12.0
4 2018-08-07 11:30:31 13.7
5 2018-08-07 11:40:31 15.6
6 2018-08-07 11:50:31 13.6
7 2018-08-07 12:00:31 12.2
8 2018-08-07 12:10:31 11.2
9 2018-08-07 12:20:31 11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31 9.7
500 2018-08-10 22:10:31 9.6
[ reached 'max' / getOption("max.print") -- omitted 8592 rows ]
It takes some minutes to clean the data through the nested for loops. How can I improve the performance of my code?
CodePudding user response:
If you convert the whole thing in a vectorized sub-setting operation it should be an order of magnitude quicker e.g.:
for (i in seq_along(data_sensor)){
data_sensor[[i]]<- data_sensor[[i]][as.Date(data_sensor[[i]]$date)<as.Date(date_admin[i]),]
}
For loops are generally some what slow it is better to avoid nested loops and use vectorized operations where possible
ps by lack of data I could not try this
CodePudding user response:
Absolutely don't do it in a loop !! There are much more efficient methods for such operations. I'll show you how to do it. But first I need to generate some data. For this, I created two little functions. rndDate randomizes the start date from "1/1/2018" to "12/31/2020", while fDateSensor returns tibble with a time series every 10 minutes.
rndDate = function(start_date=ymd("20180101"), end_date=ymd("20201231")){
sample(seq(start_date, end_date, "days"), 1)}
fDateSensor = function(n) tibble(
date = rndDate() 1:n*dminutes(10),
Temperature = rnorm(n)
)
fDateSensor(5)
output
# A tibble: 5 x 2
date Temperature
<dttm> <dbl>
1 2019-09-27 00:10:00 -0.511
2 2019-09-27 00:20:00 0.531
3 2019-09-27 00:30:00 1.42
4 2019-09-27 00:40:00 0.252
5 2019-09-27 00:50:00 -0.570
Now I'm going to make a tibble with internal tibble. First, for two dates, date_admin.
nDateSensor = 10
set.seed(1234)
date_admin = c("2018-10-07", "2019-07-29")
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
data_sensor
output
# A tibble: 2 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [10 x 2]>
2 file 2 2019-07-29 <tibble [10 x 2]>
As you can see, I simulated reading two files. Their content is in the variable data_sensor which is tibble of size 10x2.
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2019-07-28 00:10:00 -0.776
2 2019-07-28 00:20:00 0.0645
3 2019-07-28 00:30:00 0.959
4 2019-07-28 00:40:00 -0.110
5 2019-07-28 00:50:00 -0.511
6 2019-07-28 01:00:00 -0.911
7 2019-07-28 01:10:00 -0.837
8 2019-07-28 01:20:00 2.42
9 2019-07-28 01:30:00 0.134
10 2019-07-28 01:40:00 -0.491
Now for the most important moment. We will build a function f to modify our internal tibble according to your expectations.
f = function(data) {
data$data_sensor[[1]] = data$data_sensor[[1]] %>% mutate(
date = ifelse(date<data$date_admin, NA, date) %>% as_datetime(),
Temperature = ifelse(date<data$date_admin, NA, Temperature)
)
data %>% mutate(nNA = sum(is.na(data$data_sensor[[1]]$date)))
}
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
data_sensor$data_sensor
output
data_sensor$data_sensor
[[1]]
# A tibble: 10 x 2
date Temperature
<dttm> <dbl>
1 2020-10-14 00:10:00 0.314
2 2020-10-14 00:20:00 0.359
3 2020-10-14 00:30:00 -0.730
4 2020-10-14 00:40:00 0.0357
5 2020-10-14 00:50:00 0.113
6 2020-10-14 01:00:00 1.43
7 2020-10-14 01:10:00 0.983
8 2020-10-14 01:20:00 -0.622
9 2020-10-14 01:30:00 -0.732
10 2020-10-14 01:40:00 -0.517
[[2]]
# A tibble: 10 x 2
date Temperature
<dttm> <lgl>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
As you can see, everything works great.
Additionally, our f function, apart from the data_sensor mutation, returns the number of NA observations.
# A tibble: 2 x 4
# Groups: file [2]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [10 x 2]> 0
2 file 2 2019-07-29 <tibble [10 x 2]> 10
So it's time to test it on a bit bigger data. Here I used your date_admin vector and drew 106 tibbles each containing 100000 observations!
date_admin = c(
"2018-10-07", "2018-12-29", "2018-12-13", "2019-08-09", "2019-10-10",
"2019-04-26", "2018-11-21", "2018-08-23", "2019-07-08", "2019-11-19",
"2019-11-07", "2018-09-05", "2018-09-03", "2018-09-24", "2018-10-11",
"2018-09-25", "2019-03-29", "2018-08-20", "2018-09-17", "2019-03-30",
"2018-11-07", "2019-01-01", "2018-08-31", "2019-03-27", "2019-11-10",
"2019-04-04", "2019-10-18", "2018-09-06", "2018-09-23", "2018-09-22",
"2019-07-22", "2018-09-04", "2019-05-17", "2018-11-05", "2018-12-09",
"2018-09-03", "2019-05-21", "2019-02-22", "2018-08-30", "2019-06-04",
"2018-09-13", "2018-10-14", "2019-11-08", "2018-08-30", "2019-04-12",
"2018-09-24", "2018-08-22", "2018-08-30", "2018-09-07", "2018-11-11",
"2018-11-01", "2018-10-01", "2018-10-22", "2018-12-03", "2019-06-06",
"2018-09-09", "2018-09-10", "2018-09-24", "2018-10-11", "2018-11-30",
"2018-09-20", "2019-11-20", "2018-10-11", "2018-10-09", "2018-09-27",
"2019-11-11", "2018-10-04", "2018-09-14", "2019-04-27", "2018-09-04",
"2018-09-11", "2018-08-14", "2018-09-01", "2018-10-01", "2018-09-25",
"2018-09-28", "2018-09-29", "2018-10-11", "2019-03-26", "2018-10-26",
"2018-11-21", "2018-12-02", "2018-09-08", "2019-01-08", "2018-11-07",
"2019-02-05", "2019-01-21", "2018-09-11", "2018-12-17", "2019-01-15",
"2018-08-28", "2019-01-08", "2019-05-14", "2019-01-21", "2018-11-12",
"2018-10-26", "2019-12-26", "2020-01-03", "2020-01-06", "2020-02-26",
"2020-02-14", "2020-01-27", "2020-01-21", "2020-03-16", "2020-02-26",
"2019-12-31")
nDateSensor = 100000
set.seed(1234)
data_sensor =
tibble(
file = paste("file",1:length(date_admin)),
date_admin = date_admin
) %>%
mutate(data_sensor = map(file, ~fDateSensor(nDateSensor)))
output
data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <chr> <list>
1 file 1 2018-10-07 <tibble [100,000 x 2]>
2 file 2 2018-12-29 <tibble [100,000 x 2]>
3 file 3 2018-12-13 <tibble [100,000 x 2]>
4 file 4 2019-08-09 <tibble [100,000 x 2]>
5 file 5 2019-10-10 <tibble [100,000 x 2]>
6 file 6 2019-04-26 <tibble [100,000 x 2]>
7 file 7 2018-11-21 <tibble [100,000 x 2]>
8 file 8 2018-08-23 <tibble [100,000 x 2]>
9 file 9 2019-07-08 <tibble [100,000 x 2]>
10 file 10 2019-11-19 <tibble [100,000 x 2]>
# ... with 96 more rows
Time for a mutation. We will immediately measure how long it will take.
start_time =Sys.time()
data_sensor = data_sensor %>%
group_by(file) %>%
group_modify(~f(.x))
Sys.time()-start_time
For me it took 2.3 seconds. I don't know if you expected such time but it seems to be quite a good result.
Let's see what our data_sensor looks like.
# A tibble: 106 x 4
# Groups: file [106]
file date_admin data_sensor nNA
<chr> <chr> <list> <int>
1 file 1 2018-10-07 <tibble [100,000 x 2]> 0
2 file 10 2019-11-19 <tibble [100,000 x 2]> 19001
3 file 100 2020-02-26 <tibble [100,000 x 2]> 95897
4 file 101 2020-02-14 <tibble [100,000 x 2]> 7769
5 file 102 2020-01-27 <tibble [100,000 x 2]> 99497
6 file 103 2020-01-21 <tibble [100,000 x 2]> 0
7 file 104 2020-03-16 <tibble [100,000 x 2]> 50969
8 file 105 2020-02-26 <tibble [100,000 x 2]> 0
9 file 106 2019-12-31 <tibble [100,000 x 2]> 13673
10 file 11 2019-11-07 <tibble [100,000 x 2]> 16697
# ... with 96 more rows
As you can see, some of the data has been changed to NA. So everything worked fine.
All you have to do is read the xls file names into data_sensor and then usinggroup_by (file)and group_modify to load the data into the variable data_sensor. Good luck!
