Home > Back-end >  Finding the set difference(A-B) between two 1.75 GB CSV files containing 50 million rows
Finding the set difference(A-B) between two 1.75 GB CSV files containing 50 million rows

Time:01-29

I have two files having 50 million rows each and of size 1.75GB each. I am unable to load it into google colab or my computer to run a python script to find the set difference (A-B). My computer and the colab notebook crash when I try to load the data.

How do I proceed further to extract the required information?

CodePudding user response:

If you have no possibility to load file into memory, you can iterate over file B, calculate hash of each line and store it in a python set. Then you can iterate over lines of file A calculating hashes in the same way, keeping only those not present in set. It will run slow, but should run (as long as it is not single-line 3gb file).

import hashlib

b_hashes = set()
with open('fileB','rb') as fb:
    for line in fb:
        b_hashes.add(line)   # if line are short (<32 chars)
        #b_hashes.add(hashlib.md5(line).hexdigest()) #otherwise

with open('final_file.txt','wb') as f:
    with open('fileA','rb') as fa:
        for line in fa:
            if line not in b_hashes: # if lines are short
            #if hashlib.md5(line).hexdigest() not in b_hashes:  #otherwise
                f.write(line)

CodePudding user response:

Just use a while loop and load the lines one by one:

file1 = open ("file1.csv")
file2 = open ("file2.csv")
last1line = False
while not last1line:
    line1 = file1.readline ()#important, readline not readlines!!
    last1line = len (line1) == 0

    last2line = False
    while not last2line:
        line2 = file2.readline ()
        last2line = len (line2) == 0
        #compare here
    file2.seek (0)
  •  Tags:  
  • Related