Finding the set difference(A-B) between two 1.75 GB CSV files containing 50 million rows-CodePudding

I have two files having 50 million rows each and of size 1.75GB each. I am unable to load it into google colab or my computer to run a python script to find the set difference (A-B). My computer and the colab notebook crash when I try to load the data.

How do I proceed further to extract the required information?

CodePudding user response：

If you have no possibility to load file into memory, you can iterate over file B, calculate hash of each line and store it in a python set. Then you can iterate over lines of file A calculating hashes in the same way, keeping only those not present in set. It will run slow, but should run (as long as it is not single-line 3gb file).

import hashlib

b_hashes = set()
with open('fileB','rb') as fb:
    for line in fb:
        b_hashes.add(line)   # if line are short (<32 chars)
        #b_hashes.add(hashlib.md5(line).hexdigest()) #otherwise

with open('final_file.txt','wb') as f:
    with open('fileA','rb') as fa:
        for line in fa:
            if line not in b_hashes: # if lines are short
            #if hashlib.md5(line).hexdigest() not in b_hashes:  #otherwise
                f.write(line)

CodePudding user response：

Just use a while loop and load the lines one by one:

file1 = open ("file1.csv")
file2 = open ("file2.csv")
last1line = False
while not last1line:
    line1 = file1.readline ()#important, readline not readlines!!
    last1line = len (line1) == 0

    last2line = False
    while not last2line:
        line2 = file2.readline ()
        last2line = len (line2) == 0
        #compare here
    file2.seek (0)