Hello I am very new to coding, I am writing small python script but I am stuck. The goal is to compare the log.txt contents to the contents of the LargeFile.txt and every line of the log.txt that is not matching to any line of the LargeFile.txt to be stored in the outfile.txt but with the code below I only get the First line of the log.txt to repeat itself in the outfile.txt
logfile = open('log1.txt', 'r') # This file is 8KB
keywordlist = open('LargeFile.txt', 'r') # This file is 1,4GB
outfile = open('outfile.txt', 'w')
loglines = [n for n in logfile]
keywords = [n for n in keywordlist]
for line in loglines:
for word in keywords:
if line not in word:
outfile.write(line)
outfile.close()
CodePudding user response:
So conceptually you're trying to check whether any line of your 1 GB file occurs in your 8 KB file.
This means one of the files needs to be loaded into RAM, and the smaller file is the natural choice. The other file can be read sequentially and does not need to be loaded in full.
We need
- a list of lines from the smaller file
- an index of those lines for quick look-ups (we'll use a dict for this)
- a loop that runs through the large file and checks each line against the index, making note of every matching line it finds
- a loop that outputs the original lines and uses the index to determine whether they are unique or not.
The sample below prints the complete output to the console. Write it to a file as needed.
with open('log1.txt', 'r') as f:
log_lines = list(f)
index = {line: [] for line in log_lines}
with open('LargeFile.txt', 'r') as f:
for line_num, line in enumerate(f, 1):
if line in index:
index[line].append(line_num)
for line in log_lines:
if len(index[line]) == 0:
print(f'{line} -> unique')
else:
print(f'{line} -> found {len(index[line])}x')
