I am new to working with huge data sets in R. So I appreciate any help I have 20 years of data, and for each year three .csv files, each file size is about 3 GB I tried function list.files() to store the names of files in a vector and now I know theoretically I need to loop along this vector to read the files and generate integrals one by one and also combine the files. But i don't know how I can loop along and combine my data sets or run different models!!!! Constantly i get error "Error: cannot allocate vector of size 3.4 Gb" or "Memory is exhausted, reached the limit"!!!!!!! I would be thankful if anybody could guide me. Best regards Sara
CodePudding user response:
you should make sure that your computer has enough RAM memory and storage to work with this amount of data. fread() is a very useful function to wrangle data fastly. Try it : https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/fread.
If that doesn't work, then try spark. It's very fast, convenient and simple. You don't need to know it, just check this cheat sheet (https://ugoproto.github.io/ugo_r_doc/pdf/sparklyr.pdf). Best of luck!
CodePudding user response:
I'd suggest to do it with data.table. This is faster and more efficient in terms of memory than most other packages.
Here some links:
- data.table vs data.frame
- benchmark which compares different ways of handling large datasets
- data.table reference/description
In your case you could try something like:
library(data.table)
files = list.files("your_path")
df_final = data.table()
for (file in files) {
df_temp = fread(file)
### do all the mathematics you need, here just an example
df_temp[, Sum:=sum(last_bill, na.rm=TRUE), by=c("Product", "Year")]
df_final = rbind(df_final, df_temp)
}
rm(df_temp)
