Home > Mobile >  Concatenating string columns but only if not N/A in pandas
Concatenating string columns but only if not N/A in pandas

Time:01-27

I have the following dataframe that lists information on fast food road stops.

Input

first_stop   second_stop     third_stop
mcdonalds    burger king     popeyes
mcdonalds    N/A             N/A
wendys       kfc             N/A
taco bell    kfc             wendys
popeyes     kfc              panda express

I want to create a new column summary that summarizes the stops like so: Expected Output

first_stop   second_stop     third_stop        summary
mcdonalds    burger king     popeyes           mcdonalds -> burger king -> popeyes 
mcdonalds    N/A             N/A               mcdonalds
wendys       kfc             N/A               wendys -> kfc
taco bell    kfc             wendys            taco bell -> kfc -> wendys
popeyes      kfc             panda express     popeyes -> kfc -> panda express

I cannot simply concatenate the three stop columns because some have N/A values if the stop did not exist. How can i do this in pandas?

I've tried this, but obviously it won't give me what i want:

df['summary'] = df['first_stop']   '->'   df['second_stop']   '->'   df['third_stop']

CodePudding user response:

Use stack to flatten your dataframe. stack drop NaN values by default then groupby index level 0 and finally join strings.

df['summary'] = df.stack().groupby(level=0).apply(lambda x: ' -> '.join(x))
print(df)

# Output
  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express

CodePudding user response:

If you have a large dataset you could use a classical loop that will be faster than stack groupby:

df['summary'] = df.apply(lambda s: ' -> '.join(e for e in s if not pd.isna(e)),
                         axis=1)

or to stop on the first NA:

from itertools import takewhile
df['summary'] = df.apply(lambda s: ' -> '.join(
                         takewhile(lambda x: not pd.isna(x), s)
                         ), axis=1)

Or, for this particular ' -> ' separator where the characters are not expected to be found in the words:

df['summary'] = df.fillna('').apply(' -> '.join, axis=1).str.rstrip('>- ')

NB. this is a trick, doesn't work on all separators

output:

  first_stop  second_stop     third_stop                              summary
0  mcdonalds  burger king        popeyes  mcdonalds -> burger king -> popeyes
1  mcdonalds          NaN            NaN                            mcdonalds
2     wendys          kfc            NaN                        wendys -> kfc
3  taco bell          kfc         wendys           taco bell -> kfc -> wendys
4    popeyes          kfc  panda express      popeyes -> kfc -> panda express
  •  Tags:  
  • Related