I have two files having 50 million rows each and of size 1.75GB each. I am unable to load it into google colab or my computer to run a python script to find the set difference (A-B). My computer and the colab notebook crash when I try to load the data.
How do I proceed further to extract the required information?
CodePudding user response:
If you have no possibility to load file into memory, you can iterate over file B, calculate hash of each line and store it in a python set. Then you can iterate over lines of file A calculating hashes in the same way, keeping only those not present in set. It will run slow, but should run (as long as it is not single-line 3gb file).
import hashlib
b_hashes = set()
with open('fileB','rb') as fb:
for line in fb:
b_hashes.add(line) # if line are short (<32 chars)
#b_hashes.add(hashlib.md5(line).hexdigest()) #otherwise
with open('final_file.txt','wb') as f:
with open('fileA','rb') as fa:
for line in fa:
if line not in b_hashes: # if lines are short
#if hashlib.md5(line).hexdigest() not in b_hashes: #otherwise
f.write(line)
CodePudding user response:
Just use a while loop and load the lines one by one:
file1 = open ("file1.csv")
file2 = open ("file2.csv")
last1line = False
while not last1line:
line1 = file1.readline ()#important, readline not readlines!!
last1line = len (line1) == 0
last2line = False
while not last2line:
line2 = file2.readline ()
last2line = len (line2) == 0
#compare here
file2.seek (0)
