Series [] and .loc[] sometimes returns a single value, and sometimes unexpectedly a single element S-CodePudding

In the code below, I am trying to find the longest string in a DataFrame column.

Depending on the length of the column, the function below (maxstr), returns a single value for short columns (as expected), and a single element series for long columns (I didn't expect this).

Any pointers would be appreciated.

I used methods discussed in Find length of longest string in Pandas dataframe column

import numpy as np
import pandas as pd

As the data is large, I resort to displaying the information on the dataframe and series as I go along.

Read dataframe from clipboard

df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')

print(f'{type(df)=}')
print(f'{df.shape=}')
print(f'{df.dtypes=}')
print(f'{df.columns=}')

type(df)=<class 'pandas.core.frame.DataFrame'>
df.shape=(581, 6)
df.dtypes=CID           int64
TITLE        object
FIRSTNAME    object
FUNCTION     object
PHONE        object
EMAIL        object
dtype: object
df.columns=Index(['CID', 'TITLE', 'FIRSTNAME', 'FUNCTION', 'PHONE', 'EMAIL'], dtype='object')

Function to return the maximum length string equivalent in a column/series

def maxstr(ser: pd.Series):
    print(f'{type(ser)=}')

    print(f'\n{type(ser.astype(str).str.len().idxmax())=}')
    print(f'{type(ser[ser.astype(str).str.len().idxmax()])=}')

    # should return a single value and not a series
    return ser[ser.astype(str).str.len().idxmax()]

working with a short column (n=50), I get an int (as expected)

short = df.head(50)
short_return = maxstr(short['CID'])

type(ser)=<class 'pandas.core.series.Series'>

type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'numpy.int64'>

woking with long columns from the same dataframe (same data) (n=100), I get a series (not expected ??)

long = df.head(100)
long_return = maxstr(long['CID'])

type(ser)=<class 'pandas.core.series.Series'>
    
type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'pandas.core.series.Series'>

In both cases, we find the same int value (but one in a series, and the other as a single value)

short_return == long_return.iloc[0]

True

The int value is unique, so it occurs once in the dataframe column

value = short_return
print(f'The value: {value=}')
print(f'{sum(short["CID"] == value)=}')
print(f'{sum(long["CID"] == value)=}')

The value: value=1937
sum(short["CID"] == value)=1
sum(long["CID"] == value)=1

CodePudding user response：

In my opinion problem is duplicated index values, so if idxmax return tuple, which is duplicated, is returned not scalar, but all duplicated rows in selection.

Simple solution for avoid it is create default index, here change:

df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')

to:

df = pd.read_clipboard(sep='\t', na_values='')

for no MultiIndex, but default RangeIndex.

Check it if RangeIndex:

print (df.index)

Solution if need MultiIndex is remove duplicated values:

df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
df = df[~df.index.duplicated()]