Combining huge data sets in R-CodePudding

I am new to working with huge data sets in R. So I appreciate any help I have 20 years of data, and for each year three .csv files, each file size is about 3 GB I tried function list.files() to store the names of files in a vector and now I know theoretically I need to loop along this vector to read the files and generate integrals one by one and also combine the files. But i don't know how I can loop along and combine my data sets or run different models!!!! Constantly i get error "Error: cannot allocate vector of size 3.4 Gb" or "Memory is exhausted, reached the limit"!!!!!!! I would be thankful if anybody could guide me. Best regards Sara

CodePudding user response：

you should make sure that your computer has enough RAM memory and storage to work with this amount of data. fread() is a very useful function to wrangle data fastly. Try it : https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/fread.

If that doesn't work, then try spark. It's very fast, convenient and simple. You don't need to know it, just check this cheat sheet (https://ugoproto.github.io/ugo_r_doc/pdf/sparklyr.pdf). Best of luck!

CodePudding user response：

I'd suggest to do it with data.table. This is faster and more efficient in terms of memory than most other packages.

Here some links:

In your case you could try something like:

library(data.table)

files = list.files("your_path")

df_final = data.table()

for (file in files) {
  df_temp = fread(file)
  ### do all the mathematics you need, here just an example
  df_temp[, Sum:=sum(last_bill, na.rm=TRUE), by=c("Product", "Year")]
  df_final = rbind(df_final, df_temp)
}

rm(df_temp)