In the code below, I am trying to find the longest string in a DataFrame column.
Depending on the length of the column, the function below (maxstr), returns a single value for short columns (as expected), and a single element series for long columns (I didn't expect this).
Any pointers would be appreciated.
I used methods discussed in Find length of longest string in Pandas dataframe column
import numpy as np
import pandas as pd
As the data is large, I resort to displaying the information on the dataframe and series as I go along.
Read dataframe from clipboard
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
print(f'{type(df)=}')
print(f'{df.shape=}')
print(f'{df.dtypes=}')
print(f'{df.columns=}')
type(df)=<class 'pandas.core.frame.DataFrame'>
df.shape=(581, 6)
df.dtypes=CID int64
TITLE object
FIRSTNAME object
FUNCTION object
PHONE object
EMAIL object
dtype: object
df.columns=Index(['CID', 'TITLE', 'FIRSTNAME', 'FUNCTION', 'PHONE', 'EMAIL'], dtype='object')
Function to return the maximum length string equivalent in a column/series
def maxstr(ser: pd.Series):
print(f'{type(ser)=}')
print(f'\n{type(ser.astype(str).str.len().idxmax())=}')
print(f'{type(ser[ser.astype(str).str.len().idxmax()])=}')
# should return a single value and not a series
return ser[ser.astype(str).str.len().idxmax()]
working with a short column (n=50), I get an int (as expected)
short = df.head(50)
short_return = maxstr(short['CID'])
type(ser)=<class 'pandas.core.series.Series'>
type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'numpy.int64'>
woking with long columns from the same dataframe (same data) (n=100), I get a series (not expected ??)
long = df.head(100)
long_return = maxstr(long['CID'])
type(ser)=<class 'pandas.core.series.Series'>
type(ser.astype(str).str.len().idxmax())=<class 'tuple'>
type(ser[ser.astype(str).str.len().idxmax()])=<class 'pandas.core.series.Series'>
In both cases, we find the same int value (but one in a series, and the other as a single value)
short_return == long_return.iloc[0]
True
The int value is unique, so it occurs once in the dataframe column
value = short_return
print(f'The value: {value=}')
print(f'{sum(short["CID"] == value)=}')
print(f'{sum(long["CID"] == value)=}')
The value: value=1937
sum(short["CID"] == value)=1
sum(long["CID"] == value)=1
CodePudding user response:
In my opinion problem is duplicated index values, so if idxmax return tuple, which is duplicated, is returned not scalar, but all duplicated rows in selection.
Simple solution for avoid it is create default index, here change:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
to:
df = pd.read_clipboard(sep='\t', na_values='')
for no MultiIndex, but default RangeIndex.
Check it if RangeIndex:
print (df.index)
Solution if need MultiIndex is remove duplicated values:
df = pd.read_clipboard(sep='\t', index_col=[0, 1, 2, 3, 4], na_values='')
df = df[~df.index.duplicated()]
