So I've read several threads here and on other websites but I still couldn't figure out how to do it. It looks like a simple script but I've been stuck for a few days now.
Essentially, I'm given two input files (assumed to be CSV files). Each file has 2 columns, id and col_name. The difference between the 2 files is the naming conventions of the col_name. One file is lowercase, and the other is camelCase.
I would like to know how to match the two files by col_name and then create an output file with 4 columns, id, col_name, col_name, id.
Input1.csv
1 _id
2 rawrequest
3 rawresponse
4 products
5 _deleted
6 enterpriseid
7 source
8 transactionuid
9 type
10 isotransactiontype
11 status
12 terminalid
13 merchantid
14 merchantname
15 settlementbatchid
16 errordescription
17 referencetransactions
18 createdat
19 updatedat
20 __v
Input2.csv
101 _id
102 rawRequest
103 rawResponse
104 products
105 _deleted
106 enterpriseid
107 source
108 transactionUid
109 type
110 isoTransactionType
111 status
112 terminalId
113 merchantId
114 merchantName
115 settlementBatchId
116 errorDescription
117 referenceTransactions
118 createdAt
119 updatedAt
120 __v
Desired output:
1 _id _id 101
2 rawrequest rawRequest 102
3 rawresponse rawResponse 103
4 products products 104
5 _deleted _deleted 105
6 enterpriseid enterpriseid 106
7 source source 107
8 transactionuid transactionUid 108
9 type type 109
10 isotransactiontype isoTransactionType 110
11 status status 111
12 terminalid terminalId 112
13 merchantid merchantId 113
14 merchantname merchantName 114
15 settlementbatchid settlementBatchId 115
16 errordescription errorDescription 116
17 referencetransactions referenceTransactions 117
18 createdat createdAt 118
19 updatedat updatedAt 119
20 __v __v 120
The code I have tried to write:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
merge_data = csv1.merge(csv2, on = 'col_name')
merge_data.to_csv("output.csv", index = False)
# Method 2
merge = pd.merge(csv1, csv2, how="outer")
merge.to_csv("output1.csv", index = False)
# Method 3
import csv
with open('Input1.csv', 'r') as csv_file1:
csv_reader = csv.reader(csv_file1)
with open('output2.csv', 'w') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader:
csv_writer.writerow(line)
with open('Input2.csv', 'r') as csv_file2:
csv_reader2 = csv.reader(csv_file2)
with open('output2.csv', 'a') as new_file:
csv_writer = csv.writer(new_file)
for line in csv_reader2:
csv_writer.writerow(line)
Output (from the code):
output.csv
id_x,col_name,id_y
1,_id,101
4,products,104
5,_deleted,105
7,source,107
9,type,109
11,status,111
20,__v,120
25,subtype,125
29,amount,129
output1.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
output2.csv
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
6,enterpriseid
7,source
8,transactionuid
9,type
10,isotransactiontype
11,status
12,terminalid
13,merchantid
14,merchantname
15,settlementbatchid
16,errordescription
17,referencetransactions
18,createdat
19,updatedat
20,__v
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
106,enterpriseId
107,source
108,transactionUid
109,type
110,isoTransactionType
111,status
112,terminalId
113,merchantId
114,merchantName
115,settlementBatchId
116,errorDescription
117,referenceTransactions
118,createdAt
119,updatedAt
120,__v
CodePudding user response:
I don't see anywhere in your code where you convert the col_names into lowercase but you'll definitely need to do that to get the match as e.g. rawrequest is not equal to rawRequest.
I got your desired output as follows.
First, since you mentioned they are csvs with column names id and col_name I'm assuming your input files actually look like:
Input1.csv:
id,col_name
1,_id
2,rawrequest
3,rawresponse
4,products
5,_deleted
...
Input2.csv:
id,col_name
101,_id
102,rawRequest
103,rawResponse
104,products
105,_deleted
...
With these files saved I then did:
import pandas as pd
csv1 = pd.read_csv("Input1.csv")
csv2 = pd.read_csv("Input2.csv")
# Method 1
print(csv2.head())
# Make a copy of csv2 col_name
csv2['col_name_original'] = csv2['col_name']
# Convert csv2 col_name to lowercase
csv2['col_name'] = csv2['col_name'].apply(lambda x: x.lower())
# Reorder csv2 columns
csv2 = csv2[['col_name', 'col_name_original', 'id']]
# Merge on col_name
merge_data = csv1.merge(csv2, on='col_name')
# Rename columns of resulting datasheet
merge_data.columns = ['id_', 'col_name', 'col_name', 'id_']
# Save merged data
merge_data.to_csv("output.csv", index=False)
Output:
id_,col_name,col_name,id_
1,_id,_id,101
2,rawrequest,rawRequest,102
3,rawresponse,rawResponse,103
4,products,products,104
5,_deleted,_deleted,105
Explanation of the parts you asked about:
Lambda functions
Lambda is basically a shorthand way of defining a function, so whereas the usual way in Python to define a function to convert to lowercase would be:
def convert_to_lowercase(x):
return x.lower()
we can instead just write lambda x: x.lower() and that defines a function there and then and passes it straight into the apply method. Lambda expressions exist in other programming languages like JavaScript too.
https://www.w3schools.com/python/python_lambda.asp
Apply
When you call this method on a DataFrame column with a function as the argument, it returns the result of applying that function to each element in the column (e.g. in this case it applies the function lambda x: x.lower() to each value in the column and overwrites the column col_name with the result row-by-row). Equivalent to a map if you are familiar with that concept.
