Pandas extract multiple column values based on iloc and store in list-CodePudding

I have a dataframe like as below

df1 = pd.DataFrame({'region': ['ASEAN','ANZ','INDIA','KOREA'],
'first_contact':['user1','user2','user3','user4'],'first_contact_email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'second_contact':['user11','user21','user31','user41'],'second_contact_email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'third_contact':['user111','user211','user311','user411'],'third_contact_email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'fourth_contact':['user1111','user2111','user3111','user4111'],'fourth_contact_email':['[email protected]','[email protected]','[email protected]','[email protected]'],
'fifth_contact':['user11111','user21111','user31111','user41111'],'fifth_contact_email':['[email protected]','[email protected]',np.nan,np.nan],
'sixth_contact':['user111111','user211111','user311111','user411111'],'sixth_contact_email':['[email protected]','[email protected]',np.nan,np.nan]})

I would like to do the below

a) iterate through each row

b) get all email_id for that row

c) append it to a list

So, I tried the below

region_list = ['ASEAN','INDIA','KOREA','ANZ']

for region in region_list:
   email_list=[]
   temp_email_df = df1[df1['Region']==region]
   email_list.append(temp_email_df.iloc[0,2:,2].to_string(header=False, index=False))
   print(email_list)
   --send email code follows here

for each region, I would like to get their list of email ids and use them to send an email.

But now the issue is in extracting all the emails.

How can I do to extract email-ids from all the email columns using iloc or positions etc

I expect my output to print email_list for each region

CodePudding user response：

You can use isin method to filter the regions based on the list you have, then filter the columns that has email in the name, finally stack it which will essentially remove NaN values, then call to_list() to create list out of it:

>>> df1[df1['region'].isin(region_list)].filter(like='email').stack().to_list()

['[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]', '[email protected]']

But if you need to get the email list by region, you can group the dataframe by regions after filtering the regions of interest then follow the rest of logic

(df1[df1['region'].isin(region_list)]
 .set_index('region')
 .filter(like='email')
 .groupby(level=0)
 .apply(lambda x: x.stack().tolist())
 .reset_index())

  region                                                  0
0    ANZ  [[email protected], [email protected], user211@gm...
1  ASEAN  [[email protected], [email protected], user111@gm...
2  INDIA  [[email protected], [email protected], user311@gm...
3  KOREA  [[email protected], [email protected], user411@gm...

CodePudding user response：

Try as follows.

Use df.melt to unpivot df1 from wide to long format for all columns that contain the substring "email".
Use df.dropna with value (column that now contains all "emails") as the subset to get rid of NaN values.
Finally, use df.groupby to group "emails" per region, and then apply(list) to group the individual values in a list.
Result will be a pd.Series with the regions as index. So, to access the specific lists, you use, e.g., emails['ANZ'], and then first email: emails['ANZ'][0] (i.e. "[email protected]").

emails = df1.melt(id_vars='region', 
         value_vars=[col for col in df1.columns if 'email' in col]
         ).dropna(subset='value').groupby('region')['value'].apply(list)

print(emails)
region
ANZ      [[email protected], [email protected], user211@gm...
ASEAN    [[email protected], [email protected], user111@gm...
INDIA    [[email protected], [email protected], user311@gm...
KOREA    [[email protected], [email protected], user411@gm...
Name: value, dtype: object

# e.g. for `ANZ`:
print(emails['ANZ'])

['[email protected]', '[email protected]', '[email protected]', 
 '[email protected]', '[email protected]', '[email protected]']