Get max number of each pandas row from object column with strings and lists-CodePudding

I have a dataframe

import pandas as pd
import numpy as np

df1 = pd.DataFrame.from_dict(
    {"col1": [0, 0, 0, 0, 0],
    "col2": ["15", [10,15,20], "30", [20, 25], np.nan]})

which looks like this

col1	col2
0	"15"
0	[10,15,20]
0	"30"
0	[20,25]
0	NaN

For col2, I need the highest value of each row, e.g. 15 for the first row and 20 for the second row, so that I end up with the following dataframe:

df2 = pd.DataFrame.from_dict(
    {"col1": [0, 0, 0, 0, 0],
    "col2": [15, 20, 30, 25, np.nan]})

which should look like this

col1	col2
0	15
0	20
0	30
0	25
0	NaN

I tried using a for-loop that checks which type col2 for each row has, and then converts str to int, applies max() to lists and leaves nan's as they are but did not succeed. This is how I did tried (although I suggest to just ignore my attempt):

col = df1["col2"]
coltypes = []

for i in col:
#get type of each row
    coltype = type(i) 
    coltypes.append(coltype)

df1["coltypes"] = coltypes

#assign value to col3 based on type
df1["col3"] = np.where(df1["coltypes"] == str, df1["col1"].astype(int), 
                      np.where(df1["coltypes"] == list, max(df1["coltypes"]), np.nan))

Giving the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-b8eb266d5519> in <module>
      9 
     10 df1["col3"] = np.where(df1["coltypes"] == str, df1["col1"].astype(int), 
---> 11                       np.where(df1["coltypes"] == list, max(df1["coltypes"]), np.nan))

TypeError: '>' not supported between instances of 'type' and 'type'

CodePudding user response：

Let us try explode then groupby with max

out = df1.col2.explode().groupby(level=0).max()
Out[208]: 
0     15
1     20
2     30
3     25
4    NaN
Name: col2, dtype: object

CodePudding user response：

Another approach that might be easier to understand would be using apply() with a simple function that returns the max depending on the type.

import pandas as pd
import numpy as np

df1 = pd.DataFrame.from_dict(
    {"col1": [0, 0, 0, 0, 0],
    "col2": ["15", [10,15,20], "30", [20, 25], np.nan]})

def get_max(x):
    if isinstance(x, list):
        return max(x)
    elif isinstance(x, str):
        return int(x)
    else:
        return x

df1['max'] = df1['col2'].apply(get_max)

print(df1)

Output would be:

   col1          col2   max
0     0            15  15.0
1     0  [10, 15, 20]  20.0
2     0            30  30.0
3     0      [20, 25]  25.0
4     0           NaN   NaN

CodePudding user response：

@ratsrule23,

You will find the exact same question answered here