Home > Net >  Combining huge data sets in R
Combining huge data sets in R

Time:01-18

I am new to working with huge data sets in R. So I appreciate any help I have 20 years of data, and for each year three .csv files, each file size is about 3 GB I tried function list.files() to store the names of files in a vector and now I know theoretically I need to loop along this vector to read the files and generate integrals one by one and also combine the files. But i don't know how I can loop along and combine my data sets or run different models!!!! Constantly i get error "Error: cannot allocate vector of size 3.4 Gb" or "Memory is exhausted, reached the limit"!!!!!!! I would be thankful if anybody could guide me. Best regards Sara

CodePudding user response:

you should make sure that your computer has enough RAM memory and storage to work with this amount of data. fread() is a very useful function to wrangle data fastly. Try it : https://www.rdocumentation.org/packages/data.table/versions/1.14.2/topics/fread.

If that doesn't work, then try spark. It's very fast, convenient and simple. You don't need to know it, just check this cheat sheet (https://ugoproto.github.io/ugo_r_doc/pdf/sparklyr.pdf). Best of luck!

CodePudding user response:

I'd suggest to do it with data.table. This is faster and more efficient in terms of memory than most other packages.

Here some links:

In your case you could try something like:

library(data.table)

files = list.files("your_path")

df_final = data.table()

for (file in files) {
  df_temp = fread(file)
  ### do all the mathematics you need, here just an example
  df_temp[, Sum:=sum(last_bill, na.rm=TRUE), by=c("Product", "Year")]
  df_final = rbind(df_final, df_temp)
}

rm(df_temp)
  •  Tags:  
  • Related