Comparing data in two CSVs and creating a third CSV for matching data-CodePudding

Let's say I have file1.csv:

OU                                   Mailfile   
CORP:Jenny Smith:                    mail/246802.nsf
"CORP:John Smith:,John Smith:"       mail/123456.nsf
STORE:Mary Poppins:                  mail/789012.nsf
STORE:Tony Stark:                    mail/345678.nsf
STORE:Carmen Sandiego:               mail/901234.nsf
NEWS:Peter Parker:                   mail/567890.nsf
NEWS:Clark Kent:                     mail/654321.nsf
STORES:store123                      mail/369121.nsf

Then file2.csv:

OU                      
CORP
STORE     
NEWS

For every line in file2.csv that has 'CORP', 'STORE', or 'NEWS', I want to search through file1.csv and create a file, such as STOREall.csv, CORPall.csv, and NEWSall.csv.

So a file like STOREall.csv should have:

OU                                   Mailfile
STORE:Mary Poppins:                  mail/789012.nsf
STORE:Tony Stark:                    mail/345678.nsf
STORE:Carmen Sandiego:               mail/901234.nsf
STORES:store123                      mail/369121.nsf

CORPall.csv:

OU                                   Mailfile
CORP:Jenny Smith:                    mail/246802.nsf
CORP:John Smith:,John Smith:         mail/123456.nsf

Then NEWSall.csv

OU                                   Mailfile
NEWS:Peter Parker:                   mail/567890.nsf
NEWS:Clark Kent:                     mail/654321.nsf

If I can also have it with just OU column and not Mailfile column, that would also be good. But I think I can just do a usecols=['OU'] for that.

CodePudding user response：

You could read both files into dataframes, loop through the values in the 'OU' column of file2.csv, filter the data from file2.csv and save as individual CSV files.

import pandas as pd

data = pd.read_csv('file1.csv')

depts = pd.read_csv('file2.csv')

for dept in depts['OU']:
    df_dept = data[data['OU'].str.contains(dept)]
    df_dept['OU'].to_csv(f'{dept}all.csv', index=False)

CodePudding user response：

You could iterate over the names in file2 and use str.startswith to identify the correct rows:

file1 = pd.read_csv('file1.csv')
file2 = pd.read_csv('file2.csv')

for i in file2['OU']:
    file1[file1['OU'].str.startswith(i)].to_csv(f'{i}all.csv', index=False)

Then:

for i in file2['OU']:
    print(f'{i}:\n', pd.read_csv(f'{i}all.csv'), end='\n\n')

prints

CORP:
                              OU         Mailfile
0             CORP:Jenny Smith:  mail/246802.nsf
1  CORP:John Smith:,John Smith:  mail/123456.nsf

STORE:
                        OU         Mailfile
0     STORE:Mary Poppins:  mail/789012.nsf
1       STORE:Tony Stark:  mail/345678.nsf
2  STORE:Carmen Sandiego:  mail/901234.nsf
3         STORES:store123  mail/369121.nsf

NEWS:
                    OU         Mailfile
0  NEWS:Peter Parker:  mail/567890.nsf
1    NEWS:Clark Kent:  mail/654321.nsf