Pandas Dataframe Integer Column Data Cleanup-CodePudding

I have a CSV file, which I read through pandas read_csv module.

There is one column, which is supposed to have numbers only, but the data has some bad values.

Some rows (very few) have "alphanumeric" strings, few rows are empty while a few others have floating point numbers. Also, for some reason, some numbers are also being read as strings.

I want to convert it in the following way:

Alphanumeric, None, empty (numpy.nan) should be converted to 0
Floating point should be typecasted to int
Integers should remain as they are
And obvs, numbers should be read as numbers only.

How should I proceed, as I have no other idea than to read each row one by one and typecast into int, in a try-except block, while assigning 0 if exception is raised.

like:

def typecast_int(n):
    try:
        return int(n)
    except:
        return 0

for idx, row in df.iterrows:
    row["number_column"] = typecast_int(row["number_column"])

But there are some issues with this approach. Firstly, iterrows is bad performance wise. And my dataframe may have upto 700k to 1M records and I have to process ~500 such CSV files. And secondly, it just doesn't feel right to do it this way.

I could do a tad better by using df.apply instead of iterrows but that is also not too different.

CodePudding user response：

From your 4 conditions, there's

df.number_column = (pd.to_numeric(df.number_column, errors="coerce")
                         .fillna(0)
                         .astype(int))

This first converts the column to be numeric values only. If errors arise (e.g., due to alphanumerics) they got "coerce"d to NaN. Then we fill those NaN's with 0 and lastly cast everything to integers.