Comparing two files with multiple columns-CodePudding

Need to compare first column from File1 with the first column of File2. If matches, then compare second column of the two files. If second column is not matching then print the line from File1 and save the output in another file.

files1.txt

80002288    b17
97380002001 b18
97380002220 b17
97380002233 b18
80002333    b17
16501111    b04
16505044    b04
16505042    b04
97316505030  b05
16505043    b04
16505048    b04

Files2.txt

97366630003 a01
97380002288 b17
97380002001 b17
97380002220 b17
97380002233 b17
97380002333 b17
97316501111 b04
97316505044 b04
97316505042 b04
97316505030 b04
97316505043 b04

Desired Output

97380002001 b17
97316505030 b04

CodePudding user response：

Approach 1: without any external library

Use the below code to get the output using only python

with open('files3.txt', 'w') as files3:
    with open('files1.txt') as files1:
        for line_a in files1.readlines():
            words_a = line_a.split()
            with open('files2.txt') as files2:
                for line_b in files2.readlines():
                    words_b = line_b.split()
                    if words_a[0] == words_b[0] and words_a[1] != words_b[1]:
                        diff_words = ' '.join(words_b)
                        files3.write(diff_words   '\n')
                        print(diff_words)

Output of above code

97380002001 b17
97380002233 b17
97316505030 b04

Approach 2: using Pandas library

You can use the pandas library of python to achieve this. So first install pandas library like:

pip install pandas

Then run below python code to create the desired file

import pandas as pd

# you can replace files1.txt and files2.txt with the complete path if files aren't in the same folder
df1 = pd.read_csv("files1.txt", sep=r'\s ', names=['c1', 'c2'])
df2 = pd.read_csv("files2.txt", sep=r'\s ', names=['c1', 'c2'])

df3 = pd.merge(df1, df2, on='c1')
df3 = df3[(df3["c2_x"] != (df3["c2_y"]))]

# use below if you want to save values from file 2
print(df3[['c1', 'c2_y']].to_string(index=False, header=False))
df3[['c1', 'c2_y']].to_csv("files3.txt", sep=' ', index=False, header=False)

# use below if you want to save values from file 1
# print(df3[['c1', 'c2_x']].to_string(index=False, header=False))
# df3[['c1', 'c2_x']].to_csv("Files3.txt", sep=' ', index=False, header=False)

# use below code to save values from both files
# print(df3.to_string(index=False, header=False))
# df3.to_csv("Files3.txt", sep=' ', index=False, header=False)

Output of above code

97380002001 b17
97380002233 b17
97316505030 b04

CodePudding user response：

Either of these is probably what you want but your posted expected output doesn't match either interpretation of your requirements. Using any awk in any shell on every Unix box:

To print the lines from file1:

$ awk 'NR==FNR{a[$1]=$2; next} ($1 in a) && (a[$1] != $2)' file2 file1
97380002001 b18
97380002233 b18
97316505030  b05

To print the lines from file2 just swap the input file names:

$ awk 'NR==FNR{a[$1]=$2; next} ($1 in a) && (a[$1] != $2)' file1 file2
97380002001 b17
97380002233 b17
97316505030 b04