Matching 2 CSV Files (According to columns)-CodePudding

So I've read several threads here and on other websites but I still couldn't figure out how to do it. It looks like a simple script but I've been stuck for a few days now.

Essentially, I'm given two input files (assumed to be CSV files). Each file has 2 columns, id and col_name. The difference between the 2 files is the naming conventions of the col_name. One file is lowercase, and the other is camelCase.

I would like to know how to match the two files by col_name and then create an output file with 4 columns, id, col_name, col_name, id.

Input1.csv

1   _id
2   rawrequest
3   rawresponse
4   products
5   _deleted
6   enterpriseid
7   source
8   transactionuid
9   type
10  isotransactiontype
11  status
12  terminalid
13  merchantid
14  merchantname
15  settlementbatchid
16  errordescription
17  referencetransactions
18  createdat
19  updatedat
20  __v

Input2.csv

101 _id
102 rawRequest
103 rawResponse
104 products
105 _deleted
106 enterpriseid
107 source
108 transactionUid
109 type
110 isoTransactionType
111 status
112 terminalId
113 merchantId
114 merchantName
115 settlementBatchId
116 errorDescription
117 referenceTransactions
118 createdAt
119 updatedAt
120 __v

Desired output:

1   _id _id 101
2   rawrequest  rawRequest  102
3   rawresponse rawResponse 103
4   products    products    104
5   _deleted    _deleted    105
6   enterpriseid    enterpriseid    106
7   source  source  107
8   transactionuid  transactionUid  108
9   type    type    109
10  isotransactiontype  isoTransactionType  110
11  status  status  111
12  terminalid  terminalId  112
13  merchantid  merchantId  113
14  merchantname    merchantName    114
15  settlementbatchid   settlementBatchId   115
16  errordescription    errorDescription    116
17  referencetransactions   referenceTransactions   117
18  createdat   createdAt   118
19  updatedat   updatedAt   119
20  __v __v 120

The code I have tried to write:

import pandas as pd

csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")

# Method 1
merge_data = csv1.merge(csv2, on = 'col_name')
merge_data.to_csv("output.csv", index = False)

# Method 2 
merge = pd.merge(csv1, csv2, how="outer")
merge.to_csv("output1.csv", index = False)

# Method 3
import csv

with open('Input1.csv', 'r') as csv_file1:
    csv_reader = csv.reader(csv_file1)

    with open('output2.csv', 'w') as new_file:
        csv_writer = csv.writer(new_file)

        for line in csv_reader:
            csv_writer.writerow(line)

 with open('Input2.csv', 'r') as csv_file2:
    csv_reader2 = csv.reader(csv_file2)
    
    with open('output2.csv', 'a') as new_file:
        csv_writer = csv.writer(new_file)
    
        for line in csv_reader2:
            csv_writer.writerow(line)

Output (from the code):

output.csv

id_x,col_name,id_y
1,_id,101
4,products,104
5,_deleted,105
7,source,107
9,type,109
11,status,111
20,__v,120
25,subtype,125
29,amount,129

output1.csv

id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v

output2.csv

id,col_name

1,_id

2,rawrequest

3,rawresponse

4,products

5,_deleted

6,enterpriseid

7,source

8,transactionuid

9,type

10,isotransactiontype

11,status

12,terminalid

13,merchantid

14,merchantname

15,settlementbatchid

16,errordescription

17,referencetransactions

18,createdat

19,updatedat

20,__v

id,col_name

101,_id

102,rawRequest

103,rawResponse

104,products

105,_deleted

106,enterpriseId

107,source

108,transactionUid

109,type

110,isoTransactionType

111,status

112,terminalId

113,merchantId

114,merchantName

115,settlementBatchId

116,errorDescription

117,referenceTransactions

118,createdAt

119,updatedAt

120,__v

CodePudding user response：

I don't see anywhere in your code where you convert the col_names into lowercase but you'll definitely need to do that to get the match as e.g. rawrequest is not equal to rawRequest.

I got your desired output as follows.

First, since you mentioned they are csvs with column names id and col_name I'm assuming your input files actually look like:

Input1.csv:

id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
...

Input2.csv:

id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
...

With these files saved I then did:

import pandas as pd

csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")

# Method 1
print(csv2.head())
# Make a copy of csv2 col_name
csv2['col_name_original'] = csv2['col_name']
# Convert csv2 col_name to lowercase
csv2['col_name'] = csv2['col_name'].apply(lambda x: x.lower())
# Reorder csv2 columns
csv2 = csv2[['col_name', 'col_name_original', 'id']]
# Merge on col_name
merge_data = csv1.merge(csv2, on='col_name')
# Rename columns of resulting datasheet
merge_data.columns = ['id_', 'col_name', 'col_name', 'id_']
# Save merged data
merge_data.to_csv("output.csv", index=False)

Output:

id_,col_name,col_name,id_
1,_id,_id,101
2,rawrequest,rawRequest,102
3,rawresponse,rawResponse,103
4,products,products,104
5,_deleted,_deleted,105

Explanation of the parts you asked about:

Lambda functions

Lambda is basically a shorthand way of defining a function, so whereas the usual way in Python to define a function to convert to lowercase would be:

def convert_to_lowercase(x):
    return x.lower()

we can instead just write lambda x: x.lower() and that defines a function there and then and passes it straight into the apply method. Lambda expressions exist in other programming languages like JavaScript too.

https://www.w3schools.com/python/python_lambda.asp

Apply

When you call this method on a DataFrame column with a function as the argument, it returns the result of applying that function to each element in the column (e.g. in this case it applies the function lambda x: x.lower() to each value in the column and overwrites the column col_name with the result row-by-row). Equivalent to a map if you are familiar with that concept.

https://www.datacamp.com/community/tutorials/pandas-apply