I have 90 txt files with one column only. I want to find words occurring in files 1-30 but not in files 31-90.
The files are named 1.txt, 2.txt, and so on.
Is there a quick way to do this with awk, python or bash?
CodePudding user response:
A one-liner using bash, and shell utilities sort, and comm:
comm -2 -3 <(sort {1..30}.txt) <(sort {31..90}.txt)
CodePudding user response:
You might harness python's set arithmetic for this task as follows
def file_to_set(fname):
with open("unodostres.txt","r") as f:
return set(i.strip() for i in f)
words = file_to_set("1.txt")
for i in range(2,31):
words = words.intersection(str(i) ".txt")
for i in range(31,91):
words = words.difference(str(i) ".txt")
print(words)
Explanation: file_to_set read file, jettison leading and trailing whitespaces from line and convert it into set. words is created by converting 1.txt, then for 2.txt to 30.txt (range is inclusive-exclusive) I find common words between words so far and from current file and store it in words, then for 31.txt to 90.txt I remove from words all elements which are present in said files. Finally I print words.
