Home > Mobile >  How to compare 2 CSV files
How to compare 2 CSV files

Time:01-27

I have 2 CSV files:

CSV 1 - original_names.csv

Serial,Names
1,James
2,Stephen
3,Ben
4,Harry
5,Jack
6, Peter

CSV 2 - dup_names.csv

Serial,Names
1,James
2,Kate
3,Ben
4,Sara


Desired Output - new.csv

Serial,Names,flag
1,0,T
2,Kate,F
3,0,T
4,Sara,F
5,Jack,F
6,Peter,F

As you can see, the same names in both CSV will be updated to 0 if names matches to new.csv.

This is what I've tried:

import pandas as pd

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2['names'], how='inner', on = 'names')

# some code

out.to_csv("new.csv", index=False)


Thank you for your time :)

CodePudding user response:

Do an outer join, then just add some logic here. If the 2 name columns match, put a 'T' flag in, else put 'F'. Then replace the 'names' should be 0 is 'T', else the name in the second csv. If there is no name in the second csv, fill those with the name from the first csv.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'serial':[1,2,3,4,5,6],
                     'names':['James','Stephen','Ben','Harry','Jack','Peter']})

df2 = pd.DataFrame({'serial':[1,2,3,4,],
                     'names':['James','Kate','Ben','Sara']})


out = df1.merge(df2, how='outer', on = ['serial'])

out['flag'] = np.where(out.names_x == out.names_y, 'T', 'F')
out['names'] = np.where(out.flag == 'T', 0, out.names_y)
out['names'] = out['names'].fillna(out.names_x)

out = out[['serial', 'names', 'flag']]
out.to_csv("new.csv", index=False)

Output:

print(out)
   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F

CodePudding user response:

You could use:

import pandas as pd
import numpy as np

df1 = pd.read_csv('original_names.csv')
df2 = pd.read_csv('dup_names.csv')

out = df1.merge(df2, how='left', on = 'Serial')

out['Names'] = np.where(out['Names_x'] == out['Names_y'], 
                        0, out['Names_y'])
out['Names'] = out['Names'].fillna(out['Names_x'])
out['flag'] = np.where(out['Names'] == 0, 'T', 'F')
out = out.drop(['Names_x', 'Names_y'], axis=1)

out.to_csv('new.csv', index=False)

Output:

   serial  names flag
0       1      0    T
1       2   Kate    F
2       3      0    T
3       4   Sara    F
4       5   Jack    F
5       6  Peter    F
  •  Tags:  
  • Related